DEV Community: Gao Dalie (Ilyass)

How To Build a Claude Loop Engineering Better Than 99% of People

Gao Dalie (Ilyass) — Mon, 15 Jun 2026 22:55:45 +0000

I’m no longer prompting Claude. I’m just running a loop that prompts him and then thinks about what to do next. My job is to write loops.

— Boris Cherny, Head of Anthropic Claude Code, June 2026

In 2024, the key to success was “writing good prompts.” 2025 was the era of running multiple AI agents in parallel. And in June 2026, another evolution will occur in how we interact with AI.

Recently, the term “loop” has become quite a hot topic in the AI community.

The whole thing started with a post by Peter, the creator of OpenClaw.

In the world of AI agents, the term Loop Engineering is suddenly gaining attention.

Humans would enter prompts each time, read the results, and then enter the next prompt. In most cases of AI utilization up until now, humans were trapped in a loop.

AI was the worker.
Humans were the directors, supervisors, and inspectors.

This is where Loop Engineering reversed the situation.

Instead of a human having to enter prompts each time, the task is assigned to the AI, executed, the results are reviewed, and if it’s not finished yet, it’s executed again. This entire iterative process is designed as a system.

In other words, humans will shift from being “people who type prompts” to “people who create systems for typing prompts.”

Full Article in My [substack ](https://substack.com/home/post/p-202121644

I Built a 10x Cheaper Enterprise Gemini Multimodal RAG File Search

Gao Dalie (Ilyass) — Sun, 17 May 2026 13:54:45 +0000

This week in the AI industry is one you can’t afford to miss. In just seven days, four players — Anthropic, OpenAI, Google, and the US government — made major moves at the same time.

Google has announced a powerful update that further expands the functionality of the Gemini API. The highlights of this update are enhanced support for “File Search” and “Multimodal RAG”.

The Gemini API’s File Search has finally evolved into a RAG that can search for images as well.

The point isn’t that Gemini can now understand images; it could do that before. What’s interesting now is that the managed File Search store for RAG now includes image embedding, searching, metadata filtering, and citations.

“Retrieval-Augmented Generation,” a system where AI generates answers by referencing internal company documents, manuals, and databases. Previously, it only supported “text,” but with this update, it can now search documents containing “image information,” such as PDF diagrams, UI screenshots, product images, and graphs, and understand their meaning.

Actual business documents are not “just text.” Reports with charts, manuals with screenshots, and design documents with diagrams — these have finally become “true searchable items” for AI.

So, let me give you a quick demo of the live chatbot to show you how everything works.

Link to Demo: https://www.youtube.com/watch?v=a2qjEB14Rc8

I searched Google for a financial PDF that contains a complex layout with tables, charts, and graphs.

I ran Python code and uploaded the PDF. The app supports uploading multiple documents. After uploading, I clicked “Upload & index.”

When this happens, the agent first checks whether the filename already exists in the store to avoid duplicates. If it is new, the file is sent to a thread pool inside the agent environment.

The agent upload (which exists only in memory) is written to a temporary file on disk because the Gemini SDK requires a real file path. The agent then detects the file’s Multipurpose Internet Mail Extension type from its extension and attempts to upload it using upload_to_file_search_store().

If the direct upload fails (for example, if the API does not support that file type in that path), the agent switches to a fallback process.

In both cases, the result is an Operation object which represents an asynchronous backend process. The agent then enters a polling loop: it checksoperation.done, starting with a short delay (2 seconds) and gradually increasing the wait time to avoid excessive API calls.

Once the process finishes, the agent checks for operation.error. If something goes wrong, it displays the error code and message to the user. If successful, it fetches the indexed document and verifies that its state is ACTIVE, ensuring it is fully searchable.

When the user asks a question, the agent wraps it in a type.Content(role="user") object and calls generate_content() with the File Search tool enabled and connected to the store.

Then, the agent retrieves the most relevant chunks from the indexed documents, injects them into the prompt context, and generates a grounded response and checks for safety issues or empty outputs.

If citation_metadata is included, the agent loops through the citations and displays them as clickable references back to the original source documents.

What is File Search?

For those of you wondering what File Search is, it’s a RAG tool built into the Gemini API. fileSearchStoresIt creates a repository where you can put your documents, and generateContent when you call it, it automatically " picks up only the relevant parts and passes them to the model ." Its biggest selling point is that you don't need to manage your own vector database.

Three key features of the new function

According to the official blog, the core of this upgrade consists of three things:

True Native Multimodal File Search In the past, File Search was a pure text search; images could only be entered into the store by being converted into text using OCR.

“File Search now processes images and text together. Powered by the Gemini Embedding 2 model, the tool understands native image data.”

Now you can directly upload images to the File Search Store and have them indexed along with text.

The system runs on Gemini Embedding 2, where text, images, videos, audio, and files all share the same vector space.

This means you can search across different types of content without manually connecting them.

For example, you can find text using an image, find an image using text, or even find similar images using another image.

For us product developers, this means:

Text-and-image mixed search is no longer a research topic but an API call.

There is no need to maintain two stores (one for text chunks and one for CLIP-style image embeddings).
Scientific charts, UI screenshots, reports, photo albums… things that previously lost most of their semantic meaning after OCR can now be retrieved while retaining their original visual information.

Custom Metadata and Server-side Filtering

Every file you put into the store can now be tagged with a key-value pair:

{"key": "user_id", "string_value": "U1234abcd..."}
{"key": "department", "string_value": "Legal"}
{"key": "status", "string_value": "Final"}

When querying, use the google.aip.dev/160 filter syntax (the same format as most GCP list APIs):

metadata_filter='department="Legal" AND status="Final"'
The filtering was done on Google’s end first, instead of scooping up a bunch of data and then discarding it. With less noise, both speed and accuracy will increase, which is a lifesaver for multi-tenant SaaS — a single store with a metadata filter can switch tenants without having to isolate multiple stores.

My WhatsApp bot uses this method to isolate per-user data: each file is uploaded with WhatsApp information user_id A filter is applied during queries, so user A will never see user B's information in the Q&A section.

Page-level citations

Each quoted excerpt in the response will now be page-numbered.

“captures the page number for every piece of indexed information.”

This is crucial for enterprise clients. “AI tells me Y is mentioned on page X of the contract” vs. “AI tells me Y is mentioned in the contract”

The first answer is much easier for legal and auditing teams to trust because they can quickly verify the source.

The second still needs people to manually search through documents, which takes time and effort.

Page numbers unlock the final mile of the “LLM answer cannot be traced back to its source.”

What’s so great about it?

This is the most important part. I’ll summarise what will change in a table.

What’s particularly amazing is that you can now embed images directly. Until now, it was common to build your own preprocessing pipeline like “PDF charts → OCR → text embedding,” but that’s no longer necessary. Information that’s difficult to convert to text, such as slides, diagrams, photos, and UI captures, can now be included in search results.

Furthermore, the billing model is also user-friendly; storage and embedding during queries are free, and you are only charged for “ embedding during indexing “ and “ context tokens consumed by retrieved documents .”

This means you can operate it with only the cost of loading. The biggest expense in RAG operations is usually the cost of embedding, so the design that confines that to a “one-time initial charge” is quite considerate.

It supports a wide range of models, including Gemini 3.1 Pro Preview, Gemini 3.1 Flash-Lite, Gemini 3 Flash Preview, and Gemini 2.5 Pro/Flash-Lite. The ability to share the same storage between the Pro and Flash-Lite versions is very convenient.

Continue Reading the Full Article: Link

OpenClaw 5.2 + WhatsApp + ClawHub: FASTEST Way to Install Locally

Gao Dalie (Ilyass) — Fri, 15 May 2026 14:52:08 +0000

Over the past few months, one name has been hard to ignore in the AI — OpenClaw.

This project has gained explosive attention in a very short time: hundreds of thousands of stars, very high token usage, and rapid follow-up from almost all major companies.

I’ve never been this tired from keeping up with the latest version. The OpenClaw team seems to have no need for rest at all.

It’s rare to see an AI framework that uses the update date as its version number; the famous Lobster is one of them.

Five days, five version numbers — from v2026.4.29 to v2026.5.2, the open-source personal AI assistant OpenClaw has completed a round of intensive iterations on GitHub that can be described as “going all out”.

Memory system reconstruction, security hardening, multi-platform channel repair, video generation access, local voice inference… Almost every version is packed with heavyweight updates.

OpenClaw isn’t like an AI that gives you a recipe; it’s an AI that actually stands in the kitchen and cooks for you. It uses a Large-Scale Language Model (LLM) as its “brain” and messaging apps as its “hands and feet” to autonomously perform tasks on your PC.

Link to Full Video: https://www.youtube.com/watch?v=rhXFjTADP58&t=322s

In April 2026, OpenAI announced a new large-scale language model.

The newly introduced “ GPT-5.5 “ is not merely an extension of previous models, but a model developed under a new design philosophy of AI that can actually perform real work.

It can perform a series of tasks such as research, analysis, document creation, and tool operation more autonomously, and generate text and code.

Previous LLMs had the drawback that “ they could produce highly accurate output, but detailed human instructions were necessary to complete the entire task .”

On the other hand, GPT-5.5 can understand the intent behind even complex and ambiguous instructions and proceed to process them while planning its own actions.

Furthermore, it improves response speed and token efficiency while maintaining high reasoning capabilities, significantly enhancing its usability in practical applications.

What's New About Openclaw?

OpenClaw has returned to its familiar daily update schedule. This update is completely different from the previous ones that “fixed a few minor bugs”.

Active Memory Plugin: The most compelling new feature — the optional Active Memory plugin inserts a dedicated memory sub-agent before the main reply, automatically retrieving relevant preferences, context, and history details from ongoing conversations.

It supports three context modes (message /recent /full), real-time /verbose checks, and custom prompt and thinking parameters. Users no longer need to say “remember this manually” — the AI will do the searching itself.

This is a bug fix-focused release that resolves the startup crash issues that occurred in multiple channels within the npm build environment: Telegram and all bundled channels (BlueBubbles, Lark, Google Chat, IRC, Matrix, Mattermost, Teams, Nextcloud Talk, Slack, Zalo) are now uniformly loaded with the setup/secret contract via a pre-packaged top-level sidecar.

Bundled plugin compatibility metadata is also aligned with the release version. and fixes have been made for HTTP(S) proxy and NO_PROXY support in Slack Socket Mode, issues with web scraping guards skipping DNS pinning policies under trusted environment proxies, and alignment of Agent layer update_plan availability and /exec default reporting behavior in OpenAI family runtime environments.

Plugin ecosystem: The manifest can now declare activation and setup descriptors, allowing third-party plugin developers to describe the authentication, pairing, and configuration steps required for the installation process without having to write hard-coded logic in the core code.

Enhanced video generation: Supports URL-only asset delivery (no need to load large files in memory), typed providerOptions and reference audio input, adaptive aspect ratio, and higher image input limits.

Note: There are more updates, but I couldn’t cover all of them in one article.

What makes GPT-5.5 Unique?

GPT-5.5 has significantly improved its ability to quickly transform these “drafts” into a viable starting point. It’s not just about being able to write code; it has a stable ability to grasp the “workflow and intentions” and propose a framework.

Each AI has its pros and cons, but as of now (April 27, 2026), GPT seems to produce code with a structure that is less prone to corruption.

Instead of “AI doing everything,” we should “increase the number of processes handled by AI.”
There’s an expectation that “if AI can write code, does that mean humans no longer need to look at it?”, but the reality in practice is different.

Correcting the code, making minor adjustments, and finally doing a visual check — this isn’t “double work,” it’s “normal procedure.”

By reducing the time spent writing from scratch, drastically speeding up the revision process, and allowing humans to dedicate time to truly important aspects such as “the validity of branching” and “consistency with business assumptions,” the goal is not to “let AI do everything,” but rather to “increase the number of processes that AI takes on.”

This perspective fits best with current practical work.

The value of GPT-5.5 lies in the “process” rather than the “answer.”
The biggest takeaway from this presentation is that the GPT-5.5 is a model that “takes over the entire intermediate process.”

What truly takes time in the field isn’t the answer itself, but the process of organising, breaking down, testing, and verifying to arrive at it.

The same is true for AI agent development; simply speeding up the initial structuring and drafting process dramatically changes the overall speed of the project.

Considering this, the value of GPT-5.5 might lie not in “providing amazing answers,” but in “being able to take on more intermediate steps to move the work forward.”

How to Deploy OpenClaw On VPS?

In my last video, many of you were having trouble deploying OpenCLAW locally. So I did some quick research and found a one-click deployment solution.

In this video, I’ll show you step by step how to deploy OpenCLAW to a VPS, so you can chat with your assistant from anywhere. I’ll also show you how to start adding skills to expand what your assistant can do.

Let’s start the deployment process. To get the best deal on your VPS, make sure to sign up using my link. You’ll get an extra 10% off any Hostinger plan.

Once you click the link, you’ll be taken to the OpenCLAW deployment page.

Click on Start, and you will join me on the cart page to select your VPS plan. You should see our coupon code applied automatically. If not, just enter GAO manually.

First, choose your billing period. You can select 1 month, 12 months, or 24 months. For the best price, I recommend the 12-month plan.

Next, scroll down and enable daily auto backups. OpenClaw can reconfigure its own server environment, so if something breaks, having backups will help you quickly restore everything.

After that, choose the server location with the lowest latency. Unless you specifically need another region, I recommend keeping the default option.

Finally, click the Continue button.

Press enter or click to view the

Now that you’ve signed up for your VPS, you’ll land on the OpenCLAW configuration page after your payment is processed. Just a quick heads-up: this page changes often, so your screen may look a little different from mine.

The first thing you’ll see is your OpenCLAW gateway token. Click the eye icon to reveal it, then copy it and save it somewhere safe, like a password manager.

Do not share this token with anyone.

In this video, we’ll use the OpenAI API. Go to your OpenAI account, copy your API key, and paste it into the OpenAI API Key section. This will let you use multiple models with OpenCLAW.

Next, scroll down a little more. You can also enter your phone number once everything is set up correctly.

After that, click the Next button, review all the files, and then click Deploy. The deployment process usually takes around 5 minutes to finish.

Inside the Hostinger dashboard, you’ll see that we are under Docker Manager and Projects. This is how you can return to this screen later if needed.

Your project will start deploying automatically. Once it finishes, you should see a green check mark, and the status should say Running. You’ll also see that your API key is connected and OpenCLAW is running correctly.

On this screen, there are a few useful options. The Gateway Token button lets you quickly copy your token again. To open your dashboard, click the Open link.

When the dashboard opens, paste your gateway token into the login page and click Login.

That should take you to the Chat tab by default. Before we continue, there’s one more thing we should check to make sure everything is connected properly.

Click on the Overview tab and check the Gateway Connected status.

Now that we’ve accessed the OpenCLAW dashboard and confirmed that everything is connected, let’s send our first message.

Go to the Chat tab and send a simple message like “Hello.” This will start the onboarding process. The goal here is to give your assistant a basic foundation.

You’ll notice that OpenCLAW starts asking you questions. Go ahead and answer them naturally. Once you finish answering all the questions, OpenCLAW will save everything into a file called Soul.md.

Your assistant will remember this information in future conversations, helping it respond in a more personalised and useful way.

Now, let’s set up a way to message your assistant outside of the web dashboard. OpenCLAW supports several messaging platforms, including WhatsApp, Discord, Slack, and more.

In this tutorial, I’ll show you how to connect to WhatsApp because it’s one of the easiest platforms to set up.

Let’s ask OpenCLAW:

“Can you help me set up WhatsApp?”

Once OpenCLAW generates the QR code, pick up your phone and scan it. Then link your device to make sure everything is set up correctly.

After that, go to the WhatsApp section and click on the channel. You should see the connection status page.

Make sure all the connection health checks show “Yes.” If everything says connected, your WhatsApp integration is working properly.

Once everything is set up, feel free to start asking questions. For example, I’ll ask, “What’s the weather like in New York right now?” and you’ll see the response generated by OpenCLAW.

Now let’s talk about skills. Skills are basically what teach your assistant how to do new things. These can include connecting to Google Workspace, managing your calendar, or searching the web.

If you want to learn more about skills, you can check out my article, where I explain them in more detail.

The easiest way to install skills is to ask your bot directly. It will guide you through each step.

You can install anything you want, as long as you clearly name the skill you want to add.

If you get lost at any point, feel free to ask follow-up questions.

Conclsuion :

OpenClaw is not a technological breakthrough, but it marks a technological tipping point. From edge-cloud collaboration and memory architecture to edge-side evolution and the emergence of swarm intelligence, these aspects expose current problems in intelligent agents, and each level offers significant research opportunities.

Currently, we are at a crucial juncture in the transition of intelligent agent technology from “demonstration feasibility” to “large-scale deployment.”

Claude is Taking Over: Just solved session limits

Gao Dalie (Ilyass) — Sun, 10 May 2026 14:16:27 +0000

If you use Claude Code every day like I do to build apps, you’ve probably experienced this too.

Even before thinking about whether the most painful part is when it suddenly stops in the middle of your work. You could be writing code, doing deep research, or finishing the structure of an article, and then you hit a usage limit. At that moment, your concentration on the work is completely broken.

Two days ago, that change

Anthropic has announced a computing resources partnership with SpaceX.

In addition, usage limits for Claude Code have also been increased.
Most notably, the 5-hour usage limit for Claude Code has been doubled for Pro, Max, Team, and sheet-based Enterprise plans.

For those using Claude Code, this is simply a welcome change.

However, I think it would be a shame to simply end this news with the statement that “Claude Code has become more usable.”

The core issue is that competition between AI development tools is no longer only about how intelligent the models are.

No matter how advanced an AI is, if it hits limitations when you need it most, it will stop working in practical applications. On the other hand, an AI that can keep working reliably for a long time without interruption has real value.

The usability of AI tools is not determined only by the chat interface users see on the surface. The real experience also depends on the infrastructure behind it, including computing resources, data centres, power supply, and GPU availability.

Link to Video:

The usage restrictions for Claude Code have doubled.

There are three main points to what Anthropic announced this time.

The first change is doubling the 5-hour usage limit for Claude Code.
This applies to Pro, Max, Team, and sheet-based Enterprise plans.

The second change is the removal of peak-time restrictions on Claude Code for Pro and Max accounts. This change directly addresses complaints about stricter restrictions during peak hours.

The third point is to increase the API rate limits for Claude Opus models significantly. This is more relevant to developers and enterprise users than to individual users.

In other words, this change is not so much about making it “a little easier to use” as about relaxing restrictions for people who use Claude seriously.

Claude Code, in particular, is a tool that proves more valuable in longer-term projects than in one-off questions.

Have them read the code.
Have them investigate the cause.
Have them correct the code while looking at multiple files.
Have them isolate the error.
Have them think about the implementation strategy.
Have them check it again after making the corrections.

When used in this way, limitations cease to be just numbers.

Will they stop midway through the process? Or can they be entrusted with the entire workflow from start to finish? This is where experience makes all the difference.

The essence of the matter is not that “Claude has become smarter,” but rather that “the amount of information available has increased.”

In AI news, attention tends to focus on stories like “a new model has been released,” “benchmarks have improved,” or “answer accuracy has increased.”

Of course, the model’s performance is important.

However, when you actually use it in work or development, other problems become apparent.

The question is
How long can it be used continuously?

For example, let’s say you’re using AI to fix code.

First, the problem is explained.
The AI reads the code.
It guesses the cause.
It proposes a solution.
It actually corrects.
An error occurs.
It investigates again.
It looks at other files as well.
It makes another correction.

This process doesn’t end with just one question.

Rather, the value of AI development tools lies in their ability to move the work forward through repeated interaction.

If a limit is encountered along the way, the workflow will be interrupted.

From the human side, once the process stops, there are costs involved in restarting. You end up in a situation where you’re wondering
“Where did we leave off?” “What did we fix?”, and “What do we need to check next?”

Therefore, doubling the usage limit for Claude Code is not simply a change in the upper limit.

This change is quite significant for those who use AI as a “worker” rather than as a “consultant .”

The partnership with SpaceX demonstrates that computing resources have become a competitive advantage for AI companies.

What’s even more interesting about this announcement is that the relaxation of restrictions on the Claude Code is being discussed in conjunction with a computing resources partnership with SpaceX.

Anthropic has stated that it has signed a contract to utilize the computing power of SpaceX’s Colossus 1 data centre, where it will have access to over 300 megawatts of new capacity and more than 220,000 NVIDIA GPUs.

What’s important here is that for AI companies, computing resources are no longer just a behind-the-scenes tool.

The user experience of an AI service is not determined solely by the model.

The model must be intelligent.
The UI must be user-friendly.
It must have strong tool integration.
And it must have sufficient computing resources.

Only when all of these elements are present does it finally become an AI that can be used in practical applications.

In particular, with development support tools like Claude Code, the load increases as the number of users grows.

Asking short questions via chat is a relatively light use case.
However, having it read the entire codebase, work across multiple files, or maintain long contexts will significantly increase the burden on computing resources.

In other words, AI companies need to be not only “companies that create good models,” but also “companies that secure large amounts of computing resources.”

I think this is the most significant point of this news.

The next way of looking at AI development tools, which also connects to Codex and Cursor. This story isn’t just about Claude Code.

This also relates to Codex, Cursor, Windsurf, GitHub Copilot, and other AI development tools.

From now on, when looking at AI development tools, simply asking
“Which model is the smartest?”
will no longer suffice.

Rather, the following perspective is necessary.

How long can it continue working?
How many files can it handle?
Is it less likely to encounter limitations?
Is it stable even during peak hours?
Does it not interrupt the human workflow?
Can it handle everything from error isolation to correction?

This is quite important in practical work.

When comparing the performance of AI systems, our attention is often drawn to benchmarks.

However, from the perspective of someone actually using the tool, there are many situations where the more important question is,
“Can I complete this task right now?”

An AI that is somewhat intelligent but quickly hits its limits.
An AI that is a little slower but can continue working steadily.

Which one is easier to use for work depends on the situation.

I think Anthropic’s recent move will slightly change the criteria for evaluating AI development tools.

What’s important to the user is how much of the work they can continue to delegate.
This relaxation of restrictions should be especially beneficial for those who use Claude Code extensively.

Conversely, if you only use it occasionally to ask questions, you might not notice much of a difference.

However, for those who use AI to read, correct, and verify code, and even track down other problems, it is quite meaningful.

AI development tools are evolving from simply providing answers to becoming integral parts of the workflow itself.

In that case, the quality of the first answer isn’t the only important factor.

It must not stop midway.
It must maintain context.
It must be able to handle long tasks.
When a person returns, they must be able to understand where the work left off.

These factors contribute to actual usability.

The news that the usage limit for Claude Code has been doubled is, on the surface, a relaxation of the limits for users.

However, behind this lies the reality that AI companies are making significant moves to secure computing resources.

Competition in AI tools isn’t determined solely by model performance.
Nor by UI alone.
Nor by agent functionality alone.

From now on, the key will be
How stably, for how long, and to how many people can we provide intelligent AI to?

The relaxation of restrictions in the Claude Code is a fairly visible example of this change.

From the user’s perspective, it’s simply a matter of “more usable resources.” But if you take a step back, you can see that the competition among AI development tools has expanded from comparing models to comparing infrastructure.

In the future, when looking at Codex, Cursor, and other AI development tools, I think it will become increasingly important to consider not just “which one is the smartest,” but also “which one can keep the work going without interruption.”

My experience as a heavy user

In my experience, I use Claude Code on my Mac every day.

I’ve hit the five-hour usage limit so many times that I can’t even remember the last one. On Tuesday evening, I ran five coding sessions one after another.

On Thursday afternoon, I also used it with Codex at the same time for cross-checking. That feeling of stopping just before “finishing” is something only people who use AI in their work can really understand. Even slowdowns when usage is high feel heavy.

I’ve also noticed it sometimes gets slower during peak hours, especially in the morning on the US East Coast. That slowdown was something I really wanted to see fixed.

So in that sense, today’s change is something I’ve been waiting for. But when I read about the idea of on-orbit GPUs, I had a different reaction.

I started to think about how far we are going just to make things more convenient.

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI

Book an Appointment with me: https://topmate.io/gaodalie_ai
Support the Content (every Dollar goes back into the video):https://buymeacoffee.com/gaodalie98d
Subscribe to the Newsletter for free: https://substack.com/@gaodalie

How to build Claude Skills 2.0 Better than 99% of People

Gao Dalie (Ilyass) — Sun, 15 Mar 2026 19:37:07 +0000

“It’s a pain to give the same instructions to the AI every time…” “The AI never remembers the company’s rules and formats…” “Everyone on the team uses the AI in different ways, so in the end, only those who are good at it benefit…”

Anthropic has just upgraded an incredible feature that can solve these problems. It’s called “Claude Skills .” This isn’t just an update to the AI agent. It’s a next-generation feature that lets you teach the AI your business processes and specialised knowledge so that it can grow into the ultimate expert tailored to your company’s needs.

Claude Skills is the most powerful feature for teaching Claude specific tasks and workflows. What I’ve found most exciting about using it is that it eliminates the need to explain your preferences and processes in every conversation.

A Skill is a set of instructions packaged in a simple folder that you can set up once and benefit from every time. It really shines when you have a consistent workflow, such as generating front-end designs from specs or creating documents in line with your team’s style guide.

In my experience, skills are not just “macros” or “templates,” but act as a “knowledge base” that enhances Claude’s decision-making abilities. They work particularly well with built-in functions such as code execution and document creation, allowing him to process even complex tasks seamlessly.

What are Skills?

Skills are reusable pieces of knowledge and procedures that Claude Code can refer to to perform specific tasks. Each Skill is primarily defined as a Markdown file (SKILL.md) and can include associated scripts and resources as needed.

Claude Code loads the appropriate Skill in response to a user request and executes the task according to the instructions. This allows you to automate complex workflows and routine tasks consistently.

Claude Code itself acts as a reference knowledge base, allowing you to directly execute scripts and manage workflows, so you can define and execute what should be done at what time using a rule-based system.

What is Skill-creator?

Skill-creator is a “meta skill” that allows you to create, test, and improve skills in one go.

Roughly speaking, what a skill-creator does is the following five things.

Ask “What kind of skills do you want to develop?”
SKILL.md: Automatically generate a draft of
Test it by actually running it with the test prompt
Evaluate the results and propose improvements
Repeat steps 2 to 4 until you are satisfied
The idea is that Claude Code itself will go through the cycle of manual SKILL.md writing, trying, and fixing things.

How to write skiLL.md?

File structure

Basically, you can create a skill by simply creating .claude/skillsa folder and a file for the skill under SKILL.md

SKILL.md. The contents will be as follows, written in YAML and Markdown.

---
name: Your Skill Name
description: "Brief description of what this Skill does and when to use it"
---

# Your Skill Name
## Instructions
Provide clear, step-by-step guidance for Claude.
## Examples

Show concrete examples of using this Skill.
First, I will explain the YAML part.

---
name: Your Skill Name
description: Brief description of what this Skill does and when to use it
---

This is called metadata and is a very important part of creating a skill.

Claude reads the metadata at startup and only knows when each skill exists and when it’s available, incorporating it into the system prompt. This approach lets you have many skills without unnecessarily bloating your context.

When a prompt or request matches the skill’s metadata, Claude SKILL.md reads it from the file system.

The accuracy of whether or not something is actually executed depends heavily on the content of the metadata, so this is a very important factor.

Next, I will explain the content section.


# Your Skill Name
## Instructions
Provide clear, step-by-step guidance for Claude.
## Examples

Show concrete examples of using this Skill.
The metadata is always loaded when Claude starts up, but the content part is loaded at runtime. Then, when an agent skill is executed, Claude will process the contents in the content part.

Official best practices recommend keeping your SKILL.md under 500 lines. If it exceeds that, split out detailed reference material into a separate file:

.claude/skills/
  my-skill/
    SKILL.md          ← Main instructions (within 500 lines)
    templates/        ← Template files
    reference.md      ← Detailed reference material

Use instructions in SKILL.md Read to guide Claude to load additional files only when needed, unpacking as needed rather than loading everything at once. This is Progressive Disclosure: provide the core instructions first, unpacking the details as needed.

The key here is that even if Agent Skills is doing an efficient job of loading, it’s best to keep the content part brief as well — when Claude loads the content part, it will compete with the conversation history and other context.

Therefore, CLAUDE.mdomit general references to system prompts, programming languages, libraries, etc. in the content section. One of the tricks to creating a highly accurate skill is to determine which parts to omit and where to begin in the content section.

Why is this skill needed?

When responding to PR review comments, we faced the following challenges:

It’s time-consuming to check every review comment
It’s hard to know which comments are unaddressed
It’s a hassle to communicate the contents of review comments to Claude Code.

By using this Skill, Claude Code will automatically retrieve unaddressed comments and suggest fixes.

How Skills Work

A Skill is simply a folder containing commands.
At the heart of a Skill is a folder containing a SKILL.md file. This Markdown file uses YAML Front Matter to define metadata (such as name and description), while the main body contains clear, step-by-step task instructions and examples.


my-Skill.zip
  my-Skill/
  Skill.md
  resources/

Claude will automatically discover and load the relevant skills.
No manual skill triggering is required. At the start of a session, Claude scans the metadata (name and description) of all installed skills and loads this brief information into its system prompt.

When your request matches the description of a skill, Claude automatically reads and loads the complete instructions for that skill.

The “progressive disclosure” mechanism makes Skills extremely efficient.
Skill uses a three-layer structure (YAML preface, body, and file references) to gradually and on-demand feed information into the model context, avoiding a one-time overload and improving efficiency and token economy.

Skills are designed with token efficiency in mind. Upon initial loading, each Skill uses only a few dozen tokens to store its metadata. The detailed instructions for a Skill are only displayed in the context window when it is triggered.

This on-demand loading mechanism means you can install a large number of Skills without impacting model performance due to a full context window.

For more complex Skills, different instructions can be split across multiple files, and Claude reads only the parts needed for the current task, further conserving tokens.

MCP Vs Skills

Skills are another powerful layer for those already using MCP (Model Context Protocol). I find the relationship best understood with the analogy of a kitchen and a recipe.

MCP provides a professional kitchen, giving you access to the tools, ingredients, and equipment. Skills, on the other hand, are recipes that provide step-by-step instructions for creating something of value.

Combining these two allows users to accomplish complex tasks without having to figure out all the steps themselves. When I first built the MCP server, I thought that just providing access to the tools would be enough, but in reality, there was a lack of workflow guidance on how to use the tools, which confused users.

After introducing Skills, a clear division of roles was created: MCP defined what can be done, and Skills taught how to do it, and the user experience improved dramatically.

Installing this Skill

If you want to install Claude code in its best form, with all the best features, you have come to the right place. Make sure you have VS Code. If you don’t have it, go and install it. In this video, I will not cover how to install.

Let’s go ahead and open that, and now, we actually have a new project, so from here, let’s go and install Claude code inside of VS Code and open up the extensions, click search and type Claude code. Make sure that you see the verification symbol and install the extension

After you install Claude code, look at the very top, and you will see the little logo and click on that

Skills are actually a form of plugin. We use an anthropics/skills marketplace. Install Skills through plugins in the marketplace, and Claude will automatically load them when needed.

Add Skills plugin marketplace
You can also enter /plugin. The following is to add a plugin marketplace: Then, enter the official GitHub Skills address:

https://github.com/anthropics/skills

Install Skills plugin
After adding the market, you will be prompted to install skill plugins:

You can also quickly install Skills using the following command:

/plugin install document-skills@anthropic-agent-skills
/plugin install example-skills@anthropic-agent-skills

The official uses of the two skill plugins are as follows:

document-skills: A package of document skills that can handle documents such as Excel, Word, PPT, and PDF.
example-skills: Sample skill sets that can handle skill creation, MCP building, visual design, algorithmic art, web testing, Slack GIF creation, theme styling, and more.
Installation successful. You can view the added skill plugins and the marketplace by /pluginentering the command prompt. Select marketplace

You can access the skill plugin via /plugincommand line Manage plugins to perform operations such as updating and deleting:

After installation, we're going to check if the /skill-creator available. I am going to ask Claude for Claude's code

do u have the Skill Creator skill, and what does it do?
You can see right here that we do have that, so I will switch to plan mode and ask it to build us a new skill.

I want you to create a skill that helps me plan a complete one-month app launch
.I need it to break down the launch into manageable weekly chunks - first
two weeks for getting everything ready (finishing features, creating app
store materials, setting up marketing), third week for the actual launch
(testing with a small group first, reaching out to press, going live),
and the final week for monitoring how it's doing and making quick fixes.
Include some templates I can actually use, like launch checklists and social
media posts. The skill should activate whenever I mention things like "app launch plan" or
"launch my app in 30 days." It should work whether I'm launching an iPhone app,

The main text should only include things that Claude doesn’t know.
The Skill-Creator guide has this to say:

Default assumption: Claude is already very smart. Only
add context Claude doesn’t already have.

The basic premise is that Claude is intelligent to begin with, so writing general knowledge or programming basics in SKILL.md will only waste tokens.

You should focus on information you wouldn’t know (company-specific rules, quirks of specific libraries, domain-specific workflows, etc.).

It is recommended to avoid lengthy explanations and use an imperative and concise writing style.

Match the “degrees of freedom” of instructions to the task
It’s not necessary to specify everything in great detail; the key is to adjust the granularity of your instructions to suit the task.

High degree of freedom (text-based instructions)… When multiple approaches, such as writing, are effective
Moderate flexibility (pseudocode or scripts with parameters) — There is a recommended pattern, but some variation is OK
Low flexibility (specific scripts, few parameters) … When consistency of procedures is crucial, and mistakes are fatal
What types of Claude Skills are there, and where can I find them?
In terms of usage, there are two types: Claude currently supports using the official built-in Skill and locally uploaded Skills.

Based on the source of the skill, it can be divided into three types:

Official Skills, provided by Anthropic and its partners.

https://github.com/anthropics/skills

Press enter or click to view the image in full size

Claude. ai For example, the logic code behind those smooth features you use in the web version—such as "develop a web application for me," "analyse https://skillsmp.com/ this PDF document," and "write a Snake game and preview it"—is all in this repository!

You create Custom Skills and are suitable for users who need personalised customisation. Use Skill Creator to create and upload Skill files.

Community skills, shared by other users, are readily available and much faster than reinventing the wheel, making them ideal for skill selection and modification. Simply download and upload; however, be aware of security risks before use.

https://skillsmp.com/

Press enter or click to view the image in full size

https://www.aitmpl.com/skills

Press enter or click to view the image in full size

How do you determine if a task is suitable to be made into a Skill?
When you find yourself frequently requesting the same type of tasks from Claude, or have templates or assets that need to be used repeatedly, such as:

“Help me write the weekly report using the company’s template”: You need to write a team weekly report every week, and each time you need to tell Claude to organize the content according to three parts: “This week’s achievements, difficulties encountered, and next steps.” At this point, you can create a “Team Weekly Report Generator” skill.

“Create presentations in our company’s style”: Often, it’s essential to strictly adhere to brand guidelines, including logo usage, brand colors, company name, company business content, and professional expectations. You can package these guidelines into a “Brand Presentation Style” skill.

“Organizing market analysis reports/conducting competitor research using a specific format”: For example, creating a market analysis report might require combining three sets of competitor data, one set of internal sales data, and applying a fixed analytical framework. This entire complex process can be encapsulated into a “market analysis report” skill.
Conversely, if it’s just an occasional, one-off request, you can simply state it in the chat, and there’s no need to create a Skill.

Conclusion :

Claude Skills is an absolute must-have for people who frequently perform repetitive, routine tasks. It transforms your “unclear work experience” into “explicit rules” that AI can understand, allowing Anthropic’s tools to be perfectly adapted to your needs.

Whether you’re a product owner, project manager, copywriter, or anyone using Claude in the workplace, you can rely on it to reduce repetitive work and ensure consistent output — that’s the core value of Skills.

The New Nano Banana 2 + OCR + Claude Code = Powerful AI OCR PDF Editor

Gao Dalie (Ilyass) — Sun, 08 Mar 2026 18:18:19 +0000

Yesterday, when I was trying to draw an illustration that I usually insert into my note articles, I suddenly came across the words "Nano Banana 2." Huh? Wasn't it called Nano Banana Pro? Suddenly it becomes "2". Why? Since when?

Upon further investigation, I discovered that on February 26, 2026, Google suddenly announced its latest image-generation AI model. This model was announced in a surprise move by Google. I only noticed it the next day, and was blown away by how quickly it was released…!

That's Nano Banana 2. I tried it out right away and was simply blown away by the speed of its generation and the degree of evolution. The Nano Banana Pro I've been using until now is good, but the "2" isn't bad either.

The cost of generating each image has been significantly reduced to about half that of Nano Banana Pro, and resolutions up to 4K are supported. There have also been improvements in practical aspects, including more accurate text rendering and greater character consistency.

So, let me give you a quick demo of the live chatbot to show you how everything works.

Check Video :

I'll start with the sidebar. There are three main settings. First, Resolution controls the size of the generated image. Higher resolution gives you better quality, but it also makes the API calls slower and more expensive.

Second, Text Context decides whether the full extracted text of the PDF gets added to the prompt. When this option is on, the model can read the entire document and better understand the content before making edits.

In Edit Mode, you choose the pages you want to change and write a prompt for each page. You can add as many page–prompt pairs as you want. If you add the same page more than once, the agent automatically merges the prompts into a single instruction.

You can also select style reference pages before running the edits. These are pages from the same PDF that Gemini uses as a visual guide. This helps the edited slides keep the same fonts, colors, and layout as the rest of the document.

When you click Run, the agent converts each selected page into a high-resolution image using a tool called Poppler. Then it sends all page edits to Google Gemini at the same time in parallel. That means editing five pages usually takes about the same time as editing just one.

Gemini receives the page image, the style reference images, your prompt, and optionally the full document text. It processes all of this information and generates a new image of the slide with your requested changes. Sometimes it also returns a short note explaining what it modified.

Once Gemini returns the updated image, the agent runs Tesseract OCR. Tesseract scans the image and embeds a hidden text layer behind it. This turns the image back into a real PDF page, so you can still search, highlight, and copy text from it.

As each page finishes, the agent shows a side-by-side preview in the UI. You can immediately compare the original page with the edited version and see exactly what changed before downloading anything.

After all pages are processed, the agent rebuilds the full PDF. It goes through every page of the original document and replaces only the edited ones. Each replacement keeps the same dimensions as the original page, so the layout stays perfectly aligned.

In Add Mode, instead of editing a page, you create a brand-new slide. You choose where to insert it and describe what you want it to look like. The system then generates the slide from scratch using your style references as a visual guide. If you don't select any style references, the system automatically uses page 1 of the document.

The generated slide follows the same workflow. Tesseract adds a searchable text layer, the agent inserts the slide into the correct position in the PDF, and you get a preview before downloading.
This code will be available on my Patreon because it took me a lot of time and effort. If you enjoy what I create and want to see more projects like this, supporting me on Patreon helps me keep making high-quality content. I would truly appreciate your support

Why pair Claude with Nano Banana?

Claude is an excellent text and code generation AI, but it cannot generate images by itself. On the other hand, Nano Banana is good at image generation but has limitations in managing complex contexts and iterative improvement instructions.

Combining the two:

Claude understands your intent and generates the optimal prompt → Nano Banana outputs an image
Claude evaluates the generated results and identifies problems → Autonomously regenerates and corrects them
Claude maintains context during long sessions → maintains consistent workflow

In fact, when developers tried it, they were able to complete a project that involved repeatedly generating over 100 app icons for around $45.

How Nano Banana 2 works

Nano Banana 2 uses a Multimodal Diffusion Transformer (MMDiT) architecture with a parameter scale of approximately 1.8 billion (1.8B) and Dynamic Quantisation-Aware Training (DQAT) to minimise memory footprint while maintaining high output quality.

Grouped-Query Attention (GQA) is introduced to speed up inference.
GQA is a technology that significantly reduces the amount of data movement during inference by sharing key-value pairs across groups. This allows it to run continuously without thermal throttling, even on the NPU of a mobile device.

Furthermore, instead of simple pattern matching like the original Nano Banana, the new Nano uses a multi-stage loop of "Plan → Evaluate → Improve ." First, it analyses the prompt's intent and creates a generation plan.

Next, it performs character-by-character verification of the text and checks the consistency of spatial placement. If there are any problems, they are improved before proceeding to finalize the pixels.
This loop enables complex multi-object scenes and accurate text rendering.

What has changed the most? Three points

The biggest improvement is the ability to browse web information in real time. Gemini performs web searches and generates real-time information and images while adding a new feature called "World Knowledge."

This feature, not available in Nano Banana Pro, allows for more accurate depictions of real-world places, people, and products. It seems to work particularly well with infographics and illustrations.

Further improvements in text rendering: Text rendering was already quite good in Nano Banana Pro, but Nano Banana 2 introduces a new system that verifies each character in a three-step loop: "Plan → Evaluate → Improve."

Even when Chinese and numbers are mixed, the text no longer breaks down, and the improvements are noticeable when you try it out.

4K Support Exceeds the Pro Limit. While the Nano Banana Pro's maximum resolution was 2K, the Nano Banana 2 now supports 4K. The number of aspect ratios has also increased to 14 (including 9:16 and 21:9), making it suitable for everything from social media posts to cinematic banners.

🤔 So which one should I use?

The conclusion of the usage seems to be something like this.
When to use Nano Banana 2

I want to create AI illustrations for posting on SNS and notes.
I want to generate high-quality images for free.
4K resolution required
I want to create accurate illustrations and infographics that reference web information.
I want to generate a large amount of data quickly.

Situations where Nano Banana Pro is recommended
Highest quality photorealism required
Complex commercial creative production
Tasks that require professional-level precision

It seems like the Nano Banana 2 will be able to handle most of my everyday creative and AI illustration needs. I think the Pro is more of a trump card for when I really need it!

Let's start coding:

I create an extract_full_text function that reads a PDF file and pulls out all the text inside it. First, it runs a fast external tool that converts the PDF into plain text while keeping the page layout as close as possible to the original slides.
After that, the text is split into separate pages using a special page-break marker. The function then goes through each page one by one and skips any pages that are empty.

Next, it cleans the text by removing extra spaces at the beginning and end. If a page has more than 2000 characters, the text is cut down and marked as truncated so it stays shorter.

def extract_full_text(pdf_path: str) -> str:
    """Extracts the full text from a PDF using pdftotext (via subprocess for speed/layout)."""
    try:
        # Using -layout to preserve some spatial structure which is good for slides
        result = subprocess.run(
            ['pdftotext', '-layout', pdf_path, '-'],
            capture_output=True,
            text=True,
            check=True
        )
        raw_text = result.stdout

        # Split by form feed to get pages
        pages = raw_text.split('\f')

        formatted_pages = []
        for i, page_text in enumerate(pages):
            # Skip empty pages at the end if any
            if not page_text.strip():
                continue

            # Strip whitespace
            clean_text = page_text.strip()

            # Truncate to 2000 chars
            if len(clean_text) > 2000:
                clean_text = clean_text[:2000] + "...[truncated]"

            # Wrap in page tags (1-indexed)
            formatted_pages.append(f"<page-{i+1}>\n{clean_text}\n</page-{i+1}>")

        return "<document_context>\n" + "\n".join(formatted_pages) + "\n</document_context>"
    except subprocess.CalledProcessError as e:
        print(f"Error extracting text: {e}")
        return ""

After that i made a function that converts an image into a single-page PDF. It uses an OCR tool to read the text in the image and create a PDF that includes a hidden text layer. This hidden text makes the PDF searchable and easier to process later. Then it saves the generated PDF file to the location you provided.

def rehydrate_image_to_pdf(image: Image.Image, output_pdf_path: str):
    """
    Converts an image to a single-page PDF with a hidden text layer using Tesseract.
    This is the 'State Preservation' step.
    """
    pdf_bytes = pytesseract.image_to_pdf_or_hocr(image, extension='pdf')
    with open(output_pdf_path, 'wb') as f:
        f.write(pdf_bytes)

Next, I create a function that replaces specific pages in a PDF while keeping the rest of the document the same. First, it opens the original PDF and prepares a new file where the final version will be saved. Then it goes through each page in the document one by one.

If a page number appears in the replacement list, the function loads the new page that should replace it. It checks the size of the original page and resizes the new page so both pages match in width and height.

After that, the new page is added to the output document instead of the old one. If the page does not need replacement, the original page is simply copied to the new file.

def batch_replace_pages(original_pdf_path: str, replacements: dict[int, str], output_pdf_path: str):
    """
    Replaces multiple pages in the original PDF.
    replacements: dict mapping page_number (1-indexed) -> path_to_new_single_page_pdf
    """
    reader = PdfReader(original_pdf_path)
    writer = PdfWriter()

    for i in range(len(reader.pages)):
        page_num = i + 1
        if page_num in replacements:
            # This page needs replacement
            original_page = reader.pages[i]
            original_width = original_page.mediabox.width
            original_height = original_page.mediabox.height

            new_pdf_path = replacements[page_num]
            new_reader = PdfReader(new_pdf_path)
            new_page = new_reader.pages[0]

            # Resize new page to match original dimensions
            new_page.scale_to(width=float(original_width), height=float(original_height))

            writer.add_page(new_page)
        else:
            # Keep original page
            writer.add_page(reader.pages[i])

    with open(output_pdf_path, 'wb') as f:
        writer.write(f)

Next i made a function that adds a new page into an existing PDF at a specific position. First, it opens the original PDF and prepares a new document where the final version will be saved.

It then checks the size of the first page so the new page can match the same width and height. After that, the function loads the new page and resizes it to match the document's page size. If the position is set to 0, the new page is inserted at the beginning of the document.

Otherwise, the function goes through each page and inserts the new page right after the chosen page number. Finally, the updated PDF with the inserted page is saved to the output file.

def insert_page(original_pdf_path: str, new_page_pdf_path: str, after_page: int, output_pdf_path: str):
    """
    Inserts a new page into the PDF after the specified page number.
    after_page: 0 to insert at the beginning, or page number (1-indexed) to insert after.
    """
    reader = PdfReader(original_pdf_path)
    writer = PdfWriter()

    # Get dimensions from the first page as reference
    reference_page = reader.pages[0]
    ref_width = reference_page.mediabox.width
    ref_height = reference_page.mediabox.height

    # Load the new page
    new_reader = PdfReader(new_page_pdf_path)
    new_page = new_reader.pages[0]
    new_page.scale_to(width=float(ref_width), height=float(ref_height))

    # Insert at beginning
    if after_page == 0:
        writer.add_page(new_page)

    # Add all original pages, inserting the new one at the right position
    for i in range(len(reader.pages)):
        writer.add_page(reader.pages[i])
        # Insert after this page if it matches
        if i + 1 == after_page:
            writer.add_page(new_page)

    with open(output_pdf_path, 'wb') as f:
        writer.write(f)

Finally i made a function to generate a new slide image using a user prompt and optional style references. It sends the instructions to an AI model, which creates the image and optional text. The function then extracts the generated image and text and returns them.

def generate_new_slide(
    style_reference_images: List[Image.Image],
    user_prompt: str,
    full_text_context: str = "",
    resolution: str = "4K",
    enable_search: bool = False
) -> Tuple[Image.Image, Optional[str]]:
    """
    Generates a completely new slide based on style references and a prompt.
    Returns a tuple of (generated PIL Image, optional text response).
    """
    client = get_client()

    # Construct the prompt
    prompt_parts = []

    prompt_parts.append(user_prompt)

    if style_reference_images:
        prompt_parts.append("Match the visual style (fonts, colors, layout) of these reference images:")
        for img in style_reference_images:
            prompt_parts.append(img)

    if full_text_context:
        prompt_parts.append(f"DOCUMENT CONTEXT:\n{full_text_context}\n")

    # Build config - allow both text and image output
    config = types.GenerateContentConfig(
        response_modalities=['TEXT', 'IMAGE'],
        image_config=types.ImageConfig(
            image_size=resolution
        )
    )
    if enable_search:
        config.tools = [{"google_search": {}}]

    # Call the model
    try:
        response = client.models.generate_content(
            model='gemini-3-pro-image-preview',
            contents=prompt_parts,
            config=config
        )
    except Exception as e:
        error_msg = str(e).lower()
        if "quota" in error_msg or "billing" in error_msg or "payment" in error_msg:
            raise RuntimeError(
                "Gemini API Error: This tool requires a PAID API key with billing enabled.\n"
                "Free tier keys do not support image generation. Please:\n"
                "1. Visit https://aistudio.google.com/api-keys\n"
                "2. Enable billing on your Google Cloud project\n"
                f"Original error: {e}"
            )
        elif "api key" in error_msg or "authentication" in error_msg or "unauthorized" in error_msg:
            raise RuntimeError(
                "Gemini API Error: Invalid API key.\n"
                "Please check that your GEMINI_API_KEY environment variable is set correctly.\n"
                f"Original error: {e}"
            )
        else:
            raise RuntimeError(f"Gemini API Error: {e}")

    # Extract image and text from the response
    generated_image = None
    response_text = None
    if response.candidates and response.candidates[0].content.parts:
        for part in response.candidates[0].content.parts:
            if part.inline_data:
                # Convert bytes to PIL Image
                from io import BytesIO
                generated_image = Image.open(BytesIO(part.inline_data.data))
            elif part.text:
                response_text = part.text

    if not generated_image:
        raise RuntimeError("No image generated by the model.")

    return generated_image, response_text

What I thought after using it in the morning

To be honest, I thought "Pro is enough, isn't it?", but when I actually tried using them side by side, the difference was greater than I expected.

In particular, 4K support and web information reference are strengths unique to Nano Banana 2, and I think they are improvements not found in Pro. Also, text comprehension has improved.

In fact, today's thumbnail and illustration were both created in 2 and generated in one go.

The fact that the cost has been halved is also a welcome update, especially for AI generation users who like to experiment a lot.
Gemini has an advantage in terms of generation speed compared to Midjourney, and the fact that it's free to use is one of its strengths.

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI

Book an Appointment with me: https://topmate.io/gaodalie_ai
Support the Content (every Dollar goes back into the video):https://buymeacoffee.com/gaodalie98d
Subscribe to the Newsletter for free: https://substack.com/@gaodalie

RLM: The Ultimate Evolution of AI? Recursive Language Models

Gao Dalie (Ilyass) — Tue, 13 Jan 2026 17:52:19 +0000

During the weekend, I scrolled through Twitter to see what was happening in the AI community. MIT has just released a groundbreaking paper that addresses a significant issue with large language models.

It sounds very academic, but here’s the simple version: essentially, if you have AI act a second time, the results can be remarkable.

Over the past two years, almost all mainstream large-scale models have been racing to expand their context windows. Gemini has increased its window size to the millions, the GPT series continues to increase its investment, and Llama has even proclaimed a goal of tens of millions of tokens.

On the surface, this is an arms race of “who can fill the most space.” But the problem is that increasing the context window does not mean that the model can actually “read in and remember” all the content.

Another popular approach is Retrieval-Augmented Generation (RAG), which first segments long documents into chunks and stores them in a vector database, then retrieves relevant segments based on the question and feeds them to the model.

This avoids having the model consume the entire long document at once, but its effectiveness is highly dependent on the quality of the retrieval, and it often struggles with questions that require comprehensive information from the entire text.

However, these methods all share a common problem: they assume that the model is passive. The model can only wait for humans to organize, segment, and feed it information. True intelligence shouldn’t be like this.

MIT have proposed a disruptive idea: why not let the model read itself? Look it up itself? Slice it itself? Call itself?

Thus, Recursive Language Models (RLM) were born.

RLM’s core insight is very simple, yet revolutionary: it transforms the context from “input” to “environment”.

The model no longer receives a long string of tokens, but instead, like a program, treats the entire context as a variable within a REPL (Read-Eval-Print Loop) environment, allowing it to view, slice, search, filter, and recursively call itself at any time. It is no longer “fed information,” but rather “actively explores information.”

It’s like going from “Here’s a book for you to read” to “Here’s a library for you to search, dissect, summarise, and use your own assistants.”

This not only bypasses the context constraints of Transformer, but also gives the model the ability to “procedurally access the world” for the first time.

So, let me give you a quick demo of a live chatbot to show you what I mean.

Check a video

We're going to ask a question: “ Print me the first 100 powers of two, each on a newline”

If you see how the chatbot generates output, you’ll see that the agent processes the full input, which can be millions of tokens, loaded into a Python REPL environment as a variable; the agent does not read this text directly. Instead, it treats the input as an environment it can operate on.

First, the model performs exploration and inspection. It prints small slices of the context, checks structure, looks for headers, patterns, or repeated phrases, and uses tools like string slicing and regular expressions to understand how the data is organised. This step replaces passive reading with active scanning.

Next, the model applies programmatic filtering and indexing. Using Python methods such as split(), find(), re.findall(), loops, and conditionals, it narrows the massive input down to only the parts that matter for the task. Noise is discarded early, which prevents context overload.

Once relevant sections are identified, the model performs task decomposition. It breaks the main problem into smaller, well-defined subtasks. Each subtask fits comfortably within a normal model context window. Humans do not predefine this decomposition — the model decides how to split the problem based on what it discovers during exploration.

Then comes the key step: recursive self-calls. For each subtask, the model calls itself (or a smaller helper model) to process that chunk. These calls form a tree of reasoning, not a single chain. Each call returns a partial result, which is stored in variables inside the REPL environment.

After sub-results are collected, the model performs aggregation and synthesis. It uses Python logic to combine summaries, compare results, compute pairwise relationships, or assemble structured outputs like lists, tables, or long documents.

The model then applies verification and self-checking. It may re-run parts of the analysis, cross-check results with another recursive call, or validate logic using code. This creates multi-pass reasoning similar to human double-checking.

Finally, the model constructs the final output. Instead of being limited by token output size, it builds the answer piece by piece in variables and then returns the assembled result. This allows extremely long, structured outputs that traditional LLMs cannot produce.

What makes RLM special?

Press enter or click to view the image in full size

Recursive Language Models (RLMs) are special because they change an AI from a passive reader into an active problem-solver. Instead of trying to understand a huge input all at once, an RLM treats the input like a workspace it can explore, search, and break apart using code.

It decides what to read, how to slice the information, and when to call itself again to solve smaller pieces. By using programmatic access, recursion, and self-checking, it avoids getting confused by long or complex inputs and stays stable even as tasks grow harder.

This lets RLM handle massive contexts, high-complexity reasoning, and long structured outputs in a way traditional language models simply can’t.

How exactly does RLM work?

Press enter or click to view image in full size

Traditional LLMs work simply: you feed in a long string of tokens, and it gives you an answer in a single forward inference.

But when the context length exceeds hundreds of thousands or millions, this approach is like asking someone to read “War and Peace” in one go before answering a question — it’s bound to break down.

RLM’s approach is completely different.

It loads the entire long context into a Python REPL environment as a variable, such as context. The model no longer directly “eats” these tokens; instead, it accesses them by writing code, much like a programmer.

This means that for the first time, the model has a “tool.” It can:

To view a specific segment: print(context[:500])

Search keyword: re.findall(“festival”, context)

Split by chapter: part1, part2 = context.split(“Chapter 2”)

Constructing a subtask: sub_answer = llm_query(f”Please summarize {part1}”)

It can even recursively call itself: result = rlm_query(sub_prompt)

This is like giving the model “hands” and “eyes”. It is no longer a passive language generator, but an intelligent agent that can actively explore, actively deconstruct, and actively plan.

The examples in the study are very vivid. The model will first print the first 100 lines to check the structure before deciding how to slice them; it will use keywords to filter out potentially related paragraphs; it will break down the task into multiple sub-problems and then recursively call itself to solve them.

This isn’t prompt engineering; it’s program engineering.

What’s the limitation of RLM?

The main limitation of RLM is that its power comes with overhead and complexity. When the input is short and the task is simple, using the base model directly is often faster and more efficient, since RLM adds extra steps like environment interaction and recursive calls.

In its current form, RLM relies on synchronous, blocking sub-model calls, which increases end-to-end latency and can slow down responses. The paper also notes that system prompts are fixed and not tailored to different task types, leaving performance gains on the table.

Finally, letting the model write and execute code inside a REPL introduces real engineering challenges, especially around security isolation, safety, and predictable behavior.

In short, RLM is powerful for hard, large-scale problems, but it is heavier, slower, and more complex than standard models for simple tasks.

My impression :

RLM represents a shift from “how do we compress context?” to “how do we teach models to actively manage context like a skilled developer?”

Instead of fighting context limits with bigger windows or lossy summaries, RLMs embrace the constraint and learn to work within it — delegating, filtering, and focusing programmatically. It’s scaffolding that scales with learning, not just engineering.

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI

Book an Appointment with me: https://topmate.io/gaodalie_ai
Support the Content (every Dollar goes back into the video):https://buymeacoffee.com/gaodalie98d
Subscribe to the Newsletter for free: https://substack.com/@gaodalie

DSPy 3 + GEPA: The Most Advanced RAG Framework Yet — Auto Reasoning & Prompting

Gao Dalie (Ilyass) — Fri, 26 Dec 2025 09:02:59 +0000

Last week, OpenAI experienced a sudden surge in the middle of the night and went into a panic. GPT-5.2 has been released, and the global AI throne has changed hands once again.

A major update in just about four months is unusual. The trigger was competitive pressure. Reuters reports that Altman released “code red” in early December to accelerate development, and that the background to this was responding to Google’s Gemini 3.

OpenAI itself also positions it as “(rather than new features) we have improved performance in areas such as intelligence, code processing, and long-form text comprehension, and are particularly strong at creating spreadsheets, creating presentations, and other complex, multi-step tasks.

In other words, GPT-5.2 is not a “major update,” but rather a refined version that enhances reliability, long-term context, tool execution, and output generation for practical applications. It’s safe to say that it’s not a new toy, but rather a work tool that’s become easier to use.

In recent years, “agentic AI” has been performing a complex series of actions, with the LLM invoking tools, making inferences, and finally providing a final answer. To optimise these actions, the standard approach has been to use reinforcement learning (RL) to “learn good actions with rewards.” But the problem is-

RL only provides a simple scalar reward, “whether the answer is correct or not,” making learning extremely inefficient.

Additionally, fine-tuning a model requires extensive rollout and computational costs.

Last year, I created a video about DSPy, and since then, it has made significant progress. At its core, DSPy treats language models as unique “devices,” similar to CPUs and GPUs used in deep learning.

In DSPy, you only need to declare the required “Natural Language Signatures,” without worrying about the specific details of the Prompt implementation (in fact, after a year of practice, we found that worrying about those details is largely meaningless and doesn’t change the fact that LLM outputs are unstable).

DSPy can be understood as follows: based on these signatures, DSPy can automatically generate, optimise, and fine-tune the Prompt, ultimately outputting results that meet expectations.

The GEPA’s Idea: Encouraging LLMs to “Reflect on Their Own Failures Instead of using reinforcement learning, GEPA (Genetic-Pareto Prompt Optimizer) takes an approach whereby LLMs themselves analyze their own behavior in natural language and suggest how to improve next time. In other words, instead of tweaking the model’s parameters, we reflect on and evolve the “prompt” itself.

So, let me give you a quick demo of a live chatbot to show you what I mean.

Link to Video

I will prepare the SPACE_KNOWLEDGE. This technique is an alternative way to train the model that outperforms reinforcement learning, and asks a question about space: “ Which space telescope is most powerful?” If you look at how the chatbot generates the output,

you’ll see that the agent uses Term Frequency Inverse Document Frequency to calculate term frequency (how often a word appears in a document and how rare that word is across all documents, then uses cosine similarity to find which chunks are genuinely similar to your question rather than just having random word matches. Once the top three most relevant chunks are retrieved

Then the Agent uses confidence-based RAG uses chain-of-thought to generate an answer plus a confidence level so it can honestly tell you “I don’t have enough information” instead of hallucinating, while the multi-hop RAG takes it further by first extracting bullet-pointed facts from the context, then synthesizing those facts into a comprehensive answer — this two-step process is crucial for complex questions that need you to combine information from multiple sources because it prevents the AI Agent from getting confused or missing connections.

Now here’s where GEPA comes in as a game-changer: instead of manually tweaking prompts or using older optimizers like MIPROv2, GEPA uses genetic algorithms. It combines good prompts to make better ones.

It utilises Pareto optimisation to maintain multiple effective prompts, rather than just one. It also utilises reflection, which it learns from mistakes by reading text feedback and making corrections. Over time, this helps GEPA automatically generate increasingly better prompts.

It builds a prompt evolution tree. Each new improvement grows like a branch on a tree. Every branch keeps what worked before and adds a few improvements. Step by step, the prompts get closer to the best instructions for The RAG task. and it does this 35 times more efficiently than MIPROv2 while generating prompts that are 9 times shorter yet perform 10% better.

What makes GPT-5.2 stand out?

Let’s start with the most shocking data. One of the tests used to measure AI performance is called “ARC-AGI-2.”

This is a test that requires solving abstract puzzles at first sight (inspiration), and does not rely on “looking for answers in past data (cheating).” In other words, it’s a test that measures “innate intelligence,” and take a look at this score. where you can see GPT-5.1: 17.6%, Gemini 3 Pro: 31.1%, GPT-5.2: 52.9% (+35.3 points!)

This increase is crazy. It’s more than triple the score of the previous version, 5.1. It’s nearly double the score of Gemini.

If previous AIs were like “geniuses who memorised textbooks word for word,” then GPT-5.2 has evolved into “geniuses who can solve difficult problems they’ve never seen before with ingenuity.” The common AI phrase, “I can’t do it because I wasn’t taught,” is becoming a thing of the past.

The next metric worth noting is “GDPval .” This test measures how well a person can perform “real-world tasks” such as research, planning, and decision-making. GPT-5.1: 38.8%, Gemini 3 Pro: 53.5%, GPT-5.2: 70.9% (+32.1 points!)

Again, the results are overwhelming. In 5.1, the AI was a “newbie intern waiting for instructions,” but in 5.2, it has been promoted to the “manager who makes plans and manages projects”

class. Those who have complained that “AI is smart, but difficult to use at work” will be amazed by the “on-the-job capabilities” of 5.2.

What makes GEPA Unique?

The core concept of GEPA originates from the essence of human learning — reflection.

It’s not just about adding more instructions, but rather, like an experienced mentor, it examines past attempts, analyzes successes and shortcomings, and then proposes better solutions.

GEPA constructs a prompt evolution tree, allowing each optimization to grow like a branch, accumulating improvements and gradually approaching the optimal prompt.

Unlike traditional reinforcement learning (RL), GEPA leverages the reflective capabilities of language models, combined with domain-specific textual feedback, rather than relying solely on a single scalar metric.

This is akin to giving the model “X-ray vision,” enabling it to notice small details in the task and produce strong results in just a few steps.

Let’s start coding :

Let us now explore the process step by step and unravel the answer to how to use the DSPy 3, GEPA Optimiser and Agentic RAG. We will install the libraries that support the model. For this, we will do a pip install requirements.

I would like to inform you that the code I shared here is only a part of my code. If you would like the full folder, you can find it on my Patreon. This code took me a considerable amount of time


pip install requirements

Term Frequency Inverse Document Frequency.

So, I create a Term Frequency Inverse Document Frequency retriever to find the documents that best match a user’s question. First, it stores all the documents and breaks each one into simple lowercase words, removing punctuation so the text is clean and easy to compare.

Next, it looks at all documents together and calculates how important each word is across the whole collection: words that appear in many documents become less important, while words that appear in only a few documents become more important.

When a query comes in, it is cleaned and broken into words the same way, and each word is given a score based on how often it appears and how rare it is overall.

The retriever then compares the query to every document by measuring how similar their word scores are, using a mathematical method that checks how closely they point in the same direction.

Each document gets a similarity score, the documents are sorted from best match to worst, and finally, the top few most relevant documents are returned to the user.

class TFIDFRetriever:
    """
    TF-IDF (Term Frequency - Inverse Document Frequency) retriever.

    This is smarter than simple keyword matching because:
    - TF: Words that appear often in a document are important for that document
    - IDF: Words that appear in many documents are less important overall

    Example: "the" appears everywhere (low IDF), but "astronaut" is specific (high IDF)
    """

    def __init__(self, documents: list[str], k: int = 3):
        self.documents = documents
        self.k = k
        self.doc_tokens = [self._tokenize(doc) for doc in documents]
        self.idf = self._compute_idf()

    def _tokenize(self, text: str) -> list[str]:
        """Convert text to lowercase tokens, removing punctuation."""
        import re
        text = text.lower()
        tokens = re.findall(r'\b[a-z]+\b', text)
        return tokens

    def _compute_idf(self) -> dict[str, float]:
        """Compute IDF for all terms in the corpus."""
        doc_count = len(self.documents)
        term_doc_counts = Counter()

        for tokens in self.doc_tokens:
            unique_tokens = set(tokens)
            for token in unique_tokens:
                term_doc_counts[token] += 1

        idf = {}
        for term, count in term_doc_counts.items():
            # Standard IDF formula with smoothing
            idf[term] = math.log((doc_count + 1) / (count + 1)) + 1

        return idf

    def _compute_tfidf(self, tokens: list[str]) -> dict[str, float]:
        """Compute TF-IDF vector for a list of tokens."""
        tf = Counter(tokens)
        tfidf = {}
        for term, count in tf.items():
            tfidf[term] = count * self.idf.get(term, 1.0)
        return tfidf

    def _cosine_similarity(self, vec1: dict, vec2: dict) -> float:
        """Compute cosine similarity between two sparse vectors."""
        common_terms = set(vec1.keys()) & set(vec2.keys())
        if not common_terms:
            return 0.0

        dot_product = sum(vec1[t] * vec2[t] for t in common_terms)
        norm1 = math.sqrt(sum(v ** 2 for v in vec1.values()))
        norm2 = math.sqrt(sum(v ** 2 for v in vec2.values()))

        if norm1 == 0 or norm2 == 0:
            return 0.0

        return dot_product / (norm1 * norm2)

    def __call__(self, query: str) -> list[str]:
        """Retrieve top-k documents most similar to the query."""
        query_tokens = self._tokenize(query)
        query_vec = self._compute_tfidf(query_tokens)

        scores = []
        for i, doc_tokens in enumerate(self.doc_tokens):
            doc_vec = self._compute_tfidf(doc_tokens)
            score = self._cosine_similarity(query_vec, doc_vec)
            scores.append((score, i, self.documents[i]))

        # Sort by score descending
        scores.sort(key=lambda x: x[0], reverse=True)

        return [doc for score, idx, doc in scores[:self.k]]

Retrieve Argumentation Generation :

After that, I created two methods to answer questions using retrieval augmentation generation. In the first one, the Agent takes a question, looks up the most relevant documents, joins them into one context, and then generates an answer while also reporting how confident it is.

It saves the documents it used, so you can later see where the answer came from. The second system is made for harder questions that need more thinking.

It first retrieves documents the same way, then pulls out only the important facts related to the question, and finally combines those facts to create a clear answer.

It also keeps both the retrieved documents and the extracted facts, so you can inspect each step and understand how the final answer was built.

class RAGWithConfidence(dspy.Module):
"""RAG that reports its confidence in the answer."""

def __init__(self, retriever):
    super().__init__()
    self.retriever = retriever
    self.generate = dspy.ChainOfThought(AnswerWithConfidence)

def forward(self, question: str):
    docs = self.retriever(question)
    context = "\n\n".join(docs)
    result = self.generate(context=context, question=question)
    result.retrieved_docs = docs
    return result

class MultiHopRAG(dspy.Module):
    """
    Multi-hop RAG: Extract facts first, then synthesize an answer.

    This helps with complex questions that require combining information
    from multiple sources.
    """

    def __init__(self, retriever):
        super().__init__()
        self.retriever = retriever
        self.extract = dspy.Predict(ExtractFacts)
        self.synthesize = dspy.Predict(SynthesizeAnswer)

    def forward(self, question: str):
        # Step 1: Retrieve
        docs = self.retriever(question)
        context = "\n\n".join(docs)

        # Step 2: Extract relevant facts
        extraction = self.extract(context=context, question=question)

        # Step 3: Synthesize answer from facts
        result = self.synthesize(facts=extraction.facts, question=question)

        # Attach intermediate results for inspection
        result.retrieved_docs = docs
        result.extracted_facts = extraction.facts

        return result

Reflective Prompt Evolution :

Then I use GEPA learns and improve answers step by step. First, the metric checks the model’s answer against the expected answer. If the answer matches exactly, it gives a full score.

If the answer is only partly correct, it gives a lower score and explains what is missing. If the answer is wrong, it gives a low score and clear feedback about the mistake.

This feedback is important because GEPA reads it and learns how to improve future prompts. The simple RAG module then works by taking a question, retrieving related documents, joining them into one context, and generating an answer from that context.

GEPA uses the scores and feedback from the metric to automatically evolve better prompts for this RAG system over time.

def gepa_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """
    GEPA metric function with feedback.

    GEPA is special because it can use textual feedback to guide evolution.
    This function returns both a score AND feedback about what went wrong.
    """
    expected = gold.expected_answer.lower()
    actual = pred.answer.lower() if hasattr(pred, 'answer') else ""

    # Check if the key information is in the answer
    if expected in actual:
        return 1.0  # Perfect match

    # Partial credit for relevant answers
    expected_words = set(expected.split())
    actual_words = set(actual.split())
    overlap = len(expected_words & actual_words) / len(expected_words) if expected_words else 0

    if overlap > 0.5:
        score = 0.7
        feedback = f"Partially correct. Expected '{gold.expected_answer}' but got related content."
    elif overlap > 0:
        score = 0.3
        feedback = f"Contains some relevant info but missing key details. Expected: '{gold.expected_answer}'"
    else:
        score = 0.0
        feedback = f"Incorrect. Expected answer to contain '{gold.expected_answer}' but got: '{actual[:100]}...'"

    # Return score with feedback for GEPA's reflection
    from dspy.teleprompt.gepa.gepa_utils import ScoreWithFeedback
    return ScoreWithFeedback(score=score, feedback=feedback)


class SimpleRAGForOptimization(dspy.Module):
    """A simple RAG module that GEPA will optimize."""

    def __init__(self, retriever):
        super().__init__()
        self.retriever = retriever
        self.generate = dspy.Predict("context, question -> answer")

    def forward(self, question: str):
        docs = self.retriever(question)
        context = "\n\n".join(docs)
        return self.generate(context=context, question=question)

My Thoughts :

GPT-5.2 may not be a model that can do “magical new things,” but it is a model that can change “tasks that you were previously unsure about entrusting to AI” into “tasks that you can entrust with confidence. “

While future challenges remain, such as multimodal support, real-time optimisation, and safety assurance, these also represent significant development opportunities.

Beyond 2025, GEPA is expected to lead to innovative applications such as self-correcting AI systems, neural-symbolic integration, and meta-prompt engineering. GEPA will undoubtedly continue to play a central role in the future of prompt technology.

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI

Book an Appointment with me: https://topmate.io/gaodalie_ai
Support the Content (every Dollar goes back into the video):https://buymeacoffee.com/gaodalie98d
Subscribe to the Newsletter for free: https://substack.com/@gaodalie

DeepSeek-V3.2 + DocLing + Agentic RAG: Parse Any Document with Ease

Gao Dalie (Ilyass) — Mon, 15 Dec 2025 06:30:47 +0000

If you’ve been following open-source logical modelling, you know it's become a highly competitive field. Every few months, a new model comes out and says it breaks old limits, and some of them truly do

Just two days ago, after I quietly finished my exam and locked in, I was scrolling online late at night. DeepSeek, as always, sent a shockwave through the AI community.

DeepSeek launched its latest model, built for agents, “DeepSeek-V3.2,” and its high-performance version

These models have significantly improved their reasoning capabilities. combining technological innovations such as efficient sparse attention and large-scale reinforcement learning.

DeepSeek-V3.2 can go head-to-head with GPT-5, while Speciale, combining long-term thinking and theorem-proving capabilities, performs comparable to Gemini-3.0-Pro. One reader commented, “This model shouldn’t be called V3.2; it should be called V4.

In particular, the Speciale version achieved gold medal-level results at the 2025 IMO, IOI, and ICPC World Championships, placing in the top two at the ICPC World Championships and in the top ten at the IOI, achieving “Gold-medal performance.

As part of my research and development, I needed to extract text data from PDFs as accurately as possible. In the past, I have extracted text from PDFs using PyMuPDF or the OCR engine Tesseract.

These are powerful tools that have been used in many projects for many years. However, I encountered the following issue, possibly due to the PDF I was working with.

Docling, an open source library developed by IBM Research, is an effective solution to these challenges. Docling is a powerful tool that can structure and convert documents, such as PDFs and Word files, into Markdown.

So, let me give you a quick demo of a live chatbot to show you what I mean.

Link

I’ll upload an Ocean AI PDF and ask the chatbot a question: “What is Ocean AI, and why is Ocean AI different from OpenAI?”

If you look at how the chatbot generates the output, you’ll see that the agent first runs a relevance check to determine whether the question is actually related to your uploaded documents. If it’s not relevant, the agent immediately rejects the question instead of generating a hallucinated answer.

For relevant questions, the agent parses the documents into structured formats such as Markdown or JSON. Then perform hybrid retrieval using both BM25 keyword search and vector embeddings to find the most relevant sections, even across multiple documents.

The Research Agent uses this retrieved content to generate an answer, and then the Verification Agent cross-checks the response against the original documents to confirm factual accuracy and catch unsupported claims or contradictions.

If verification fails, a self-correction loop automatically re-runs retrieval and research with adjusted parameters until the answer passes all checks. Once the answer is fully verified, the agent returns it. If at any point the question is found to be unrelated to the uploaded content, the agent clearly tells you instead of hallucinating.

What makes DeepSeek-V3.2 Unique?

Most powerful AI models face a common problem: as file length increases, model execution speed decreases significantly, and costs rise dramatically. This is because traditional models attempt to compare each word with all other words to understand the context.

DeepSeek-V3.2 addresses this problem by introducing a new method called DeepSeek Sparse Attention (DSA). You can think of it as a researcher conducting research in a library:

Traditional method (intensive attention): Researchers read every book on the shelf, page by page, just to answer one question. While comprehensive, this method is extremely slow and requires immense effort.
The new method (DeepSeek-V3.2): Researchers use a digital index (Lightning Indexer) to find key pages and read only those pages quickly. This method is just as accurate, but much faster.

What makes Docling Unique?

The biggest reason why Docling stands out from existing tools is that its design concept is based on collaboration with generative AI, particularly RAG (Retrieval Augmented Generation).

Modern AI applications require more than just extracting text. For AI to deeply understand the content of a document and generate accurate answers, it needs to know its meaning, including:

Is this sentence the “abstract” or the “conclusion” of the paper?
This string of numbers is not just text but a “table,” so what does each cell mean?
What “caption” accompanies this image?

While PyMuPDF and Tesseract extract text as “strings,” Docking uses the power of the Visual Language Model (VLM) to analyse these structures and relationships and output them as a “DoclingDocument” object with rich information.

This structured data is the key to dramatically improving RAG’s retrieval and answer generation quality.

Let’s Start Coding :

Let us now explore step by step and unravel the answer to how to use the DeepSeek-V3.2 + DocLing + Agentic RAG. We will install the libraries that support the model. For this, we will do a pip install requirements

pip install requirements

The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed.

DocumentConverter: A high-level Python class designed for converting documents into a structured DoclingDocument format.

EnsembleRetriever: Ensemble retriever that aggregates and orders the results of multiple retrievers by using weighted Reciprocal Rank Fusion.
**
DocLing:**

I created a VerificationAgent class that fact-checks AI-generated answers against source documents. In initI instantiate a deepseek-v3.2 model with zero temperature for deterministic outputs and build a prompt template that asks the LLM to verify answers in 4 specific ways: whether claims are directly supported, what's unsupported, what contradicts, and if it's relevant, forcing a structured response format for consistent parsing.

In check()I take the answer string and a list of Document objects, extract and concatenate all the document text into one context string, then create a LangChain pipeline (prompt → LLM → string parser) that I invoke with the answer and context to get back a verification report.

I log both the report and context for debugging, re-raise any errors that occur, and return a dict containing the verification report text and the context string. The whole point is to catch hallucinations by checking if the RAG system's generated answer is actually supported by the source documents.

import os
import hashlib
import pickle
from datetime import datetime, timedelta
from pathlib import Path
from typing import List
from docling.document_converter import DocumentConverter
from langchain_text_splitters import MarkdownHeaderTextSplitter
from config import constants
from config.settings import settings
from utils.logging import logger

class DocumentProcessor:
    def __init__(self):
        self.headers = [("#", "Header 1"), ("##", "Header 2")]
        self.cache_dir = Path(settings.CACHE_DIR)
        self.cache_dir.mkdir(parents=True, exist_ok=True)

    def validate_files(self, files: List) -> None:
        """Validate the total size of the uploaded files."""
        total_size = sum(os.path.getsize(f.name) for f in files)
        if total_size > constants.MAX_TOTAL_SIZE:
            raise ValueError(f"Total size exceeds {constants.MAX_TOTAL_SIZE//1024//1024}MB limit")

    def process(self, files: List) -> List:
        """Process files with caching for subsequent queries"""
        self.validate_files(files)
        all_chunks = []
        seen_hashes = set()

        for file in files:
            try:
                # Generate content-based hash for caching
                with open(file.name, "rb") as f:
                    file_hash = self._generate_hash(f.read())

                cache_path = self.cache_dir / f"{file_hash}.pkl"

                if self._is_cache_valid(cache_path):
                    logger.info(f"Loading from cache: {file.name}")
                    chunks = self._load_from_cache(cache_path)
                else:
                    logger.info(f"Processing and caching: {file.name}")
                    chunks = self._process_file(file)
                    self._save_to_cache(chunks, cache_path)

                # Deduplicate chunks across files
                for chunk in chunks:
                    chunk_hash = self._generate_hash(chunk.page_content.encode())
                    if chunk_hash not in seen_hashes:
                        all_chunks.append(chunk)
                        seen_hashes.add(chunk_hash)

            except Exception as e:
                logger.error(f"Failed to process {file.name}: {str(e)}")
                continue

        logger.info(f"Total unique chunks: {len(all_chunks)}")
        return all_chunks

    def _process_file(self, file) -> List:
        """Original processing logic with Docling"""
        if not file.name.endswith(('.pdf', '.docx', '.txt', '.md')):
            logger.warning(f"Skipping unsupported file type: {file.name}")
            return []

        converter = DocumentConverter()
        markdown = converter.convert(file.name).document.export_to_markdown()
        splitter = MarkdownHeaderTextSplitter(self.headers)
        return splitter.split_text(markdown)

    def _generate_hash(self, content: bytes) -> str:
        return hashlib.sha256(content).hexdigest()

    def _save_to_cache(self, chunks: List, cache_path: Path):
        with open(cache_path, "wb") as f:
            pickle.dump({
                "timestamp": datetime.now().timestamp(),
                "chunks": chunks
            }, f)

    def _load_from_cache(self, cache_path: Path) -> List:
        with open(cache_path, "rb") as f:
            data = pickle.load(f)
        return data["chunks"]

    def _is_cache_valid(self, cache_path: Path) -> bool:
        if not cache_path.exists():
            return False

        cache_age = datetime.now() - datetime.fromtimestamp(cache_path.stat().st_mtime)
        return cache_age < timedelta(days=settings.CACHE_EXPIRE_DAYS)

RelevanceChecker

I created a RelevanceChecker class that determines whether retrieved documents can answer a user's question by classifying them into three categories.

In init, I initialize a deepseek-v3.2 model with the API key and create a prompt template that instructs the LLM to classify passages as "CAN_ANSWER" (fully answers), "PARTIAL" (mentions topic but incomplete), or "NO_MATCH" (doesn't discuss topic at all), with emphasis that any topic mention should be "PARTIAL" not "NO_MATCH". I built a LangChain chain by piping prompt → LLM → string parser.

In the check() method, I take a question, a retriever object, and a k parameter (default 3) for how many top documents to analyse. I invoke the retriever with the question to get relevant chunks, returning "NO_MATCH" immediately if nothing comes back.

I print debug info showing document count and 200-character previews of the top k chunks for visibility. I combine the top k document texts into one string with double newlines, invoke the LLM chain with the question and combined content, and get back a classification string.

I validate the response is one of the three valid labels by converting to uppercase and checking against valid options, forcing "NO_MATCH" if the LLM returns something unexpected.
Finally, I return the validated classification, giving me a clear signal about whether my retriever found usable documents or if I need to fall back to alternative methods like web search.

# agents/relevance_checker.py
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_deepseek import ChatDeepSeek
from config.settings import settings

class RelevanceChecker:
    def __init__(self):
        # self.llm = ChatOpenAI(api_key=settings.OPENAI_API_KEY, model="gpt-4o")
        self.llm = ChatDeepSeek(api_key=settings.DEEPSEEK_API_KEY, model="deepseek-chat")

        self.prompt = ChatPromptTemplate.from_template(
            """
            You are given a user question and some passages from uploaded documents.

            Classify how well these passages address the user's question. 
            Choose exactly one of the following responses (respond ONLY with that label):

            1) "CAN_ANSWER": The passages contain enough explicit info to fully answer the question.
            2) "PARTIAL": The passages mention or discuss the question's topic (e.g., relevant years, facility names)
            but do not provide all the data or details needed for a complete answer.
            3) "NO_MATCH": The passages do not discuss or mention the question's topic at all.

            Important: If the passages mention or reference the topic or timeframe of the question in ANY way,
            even if incomplete, you should respond "PARTIAL", not "NO_MATCH".

            Question: {question}
            Passages: {document_content}

            Respond ONLY with "CAN_ANSWER", "PARTIAL", or "NO_MATCH".
            """
        )

        self.chain = self.prompt | self.llm | StrOutputParser()

    def check(self, question: str, retriever, k=3) -> str:
        """
        1. Retrieve the top-k document chunks from the global retriever.
        2. Combine them into a single text string.
        3. Pass that text + question to the LLM chain for classification.

        Returns: "CAN_ANSWER" or "PARTIAL" or "NO_MATCH".
        """

        print(f"[DEBUG] RelevanceChecker.check called with question='{question}' and k={k}")

        # Retrieve doc chunks from the retriever
        top_docs = retriever.invoke(question)[:k]  # Only use top k docs
        if not top_docs:
            print("[DEBUG] No documents returned from retriever.invoke(). Classifying as NO_MATCH.")
            return "NO_MATCH"

        print(f"[DEBUG] Retriever returned {len(top_docs)} docs.")

        # Show a quick snippet of each chunk for debugging
        for i, doc in enumerate(top_docs):
            snippet = doc.page_content[:200].replace("\n", "\\n")
            print(f"[DEBUG] Chunk #{i+1} preview (first 200 chars): {snippet}...")

        # Combine the top k chunk texts into one string
        document_content = "\n\n".join(doc.page_content for doc in top_docs)
        print(f"[DEBUG] Combined text length for top {k} chunks: {len(document_content)} chars.")

        # Call the LLM
        response = self.chain.invoke({
            "question": question, 
            "document_content": document_content
        }).strip()

        print(f"[DEBUG] LLM raw classification response: '{response}'")

        # Convert to uppercase, check if it's one of our valid labels
        classification = response.upper()
        valid_labels = {"CAN_ANSWER", "PARTIAL", "NO_MATCH"}
        if classification not in valid_labels:
            print("[DEBUG] LLM did not respond with a valid label. Forcing 'NO_MATCH'.")
            classification = "NO_MATCH"
        else:
            print(f"[DEBUG] Classification recognized as '{classification}'.")

        return classification

ResearchAgent

I created a ResearchAgent class that generates answers to questions using retrieved documents as context.

I create a prompt template that asks the LLM to answer questions based on provided context, being precise and factual, with an instruction to explicitly say "I cannot answer this question based on the provided documents" if the context is insufficient.

In the generate() method, I take a question string and a list of Document objects, then extract and concatenate all document text into one context string using double newlines as separators.

I invoke the chain with the question and context, which substitutes them into the template, sends the request to DeepSeek, and returns the generated answer as a string. I wrap this in try-except to log both the answer and full context for debugging, and re-raise any exceptions that occur.

Finally, I return a dictionary containing the draft answer and the context used, giving me both the generated response and traceability of what source material was used to create it.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from typing import Dict, List
from langchain_core.documents import Document
from langchain_deepseek import ChatDeepSeek
from config.settings import settings
import logging

logger = logging.getLogger(__name__)

class ResearchAgent:
    def __init__(self):
        """Initialize the research agent with the OpenAI model."""
        # self.llm = ChatOpenAI(
        #     model="gpt-4-turbo",
        #     temperature=0.3,
        #     api_key=settings.OPENAI_API_KEY  # Pass the API key here
        # )
        self.llm = ChatDeepSeek(
            model="deepseek-chat",
            temperature=0.3,
            api_key=settings.DEEPSEEK_API_KEY  # Pass the API key here
        )
        self.prompt = ChatPromptTemplate.from_template(
            """Answer the following question based on the provided context. Be precise and factual.

            Question: {question}

            Context:
            {context}

            If the context is insufficient, respond with: "I cannot answer this question based on the provided documents."
            """
        )

    def generate(self, question: str, documents: List[Document]) -> Dict:
        """Generate an initial answer using the provided documents."""
        context = "\n\n".join([doc.page_content for doc in documents])

        chain = self.prompt | self.llm | StrOutputParser()
        try:
            answer = chain.invoke({
                "question": question,
                "context": context
            })
            logger.info(f"Generated answer: {answer}")
            logger.info(f"Context used: {context}")
        except Exception as e:
            logger.error(f"Error generating answer: {e}")
            raise

        return {
            "draft_answer": answer,
            "context_used": context
        }

Verification Agent

I created a VerificationAgent class that fact-checks AI-generated answers against source documents to catch hallucinations. In initI initialise a deepseek-v3.2 model with temperature 0 (fully deterministic), create a prompt template that instructs the LLM to verify four aspects (direct factual support, unsupported claims, contradictions, and relevance) with a structured response format, then build a LangChain chain.

In check()I take an answer string and a list of Document objects, concatenate all document text into one context string with double newlines, invoke the chain with the answer and context to get a verification report, log both the report and context for debugging in a try-except block, and return a dictionary with the verification report and context used for traceability.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from typing import Dict, List
from langchain_core.documents import Document
from langchain_deepseek import ChatDeepSeek
from config.settings import settings
import logging

logger = logging.getLogger(__name__)

class VerificationAgent:
    def __init__(self):
        # self.llm = ChatOpenAI(
        #     model="gpt-4-turbo",
        #     temperature=0,
        #     api_key=settings.OPENAI_API_KEY  # Pass the API key here
        # )
        self.llm = ChatDeepSeek(
            model="deepseek-chat",
            temperature=0,
            api_key=settings.DEEPSEEK_API_KEY  # Pass the API key here
        )
        self.prompt = ChatPromptTemplate.from_template(
            """Verify the following answer against the provided context. Check for:
            1. Direct factual support (YES/NO)
            2. Unsupported claims (list)
            3. Contradictions (list)
            4. Relevance to the question (YES/NO)

            Respond in this format:
            Supported: YES/NO
            Unsupported Claims: [items]
            Contradictions: [items]
            Relevant: YES/NO

            Answer: {answer}
            Context: {context}
            """
        )

    def check(self, answer: str, documents: List[Document]) -> Dict:
        """Verify the answer against the provided documents."""
        context = "\n\n".join([doc.page_content for doc in documents])

        chain = self.prompt | self.llm | StrOutputParser()
        try:
            verification = chain.invoke({
                "answer": answer,
                "context": context
            })
            logger.info(f"Verification report: {verification}")
            logger.info(f"Context used: {context}")
        except Exception as e:
            logger.error(f"Error verifying answer: {e}")
            raise

        return {
            "verification_report": verification,
            "context_used": context
        }

Conclusion :

DeepSeek V3.2 doesn't win by scale, but by smarter thinking. With its sparse attention mechanism, lower cost, stronger long-context awareness, and superior tool-use inference capabilities, it demonstrates how open-source models can remain competitive without massive hardware budgets.

While it may not top every benchmark, it significantly improves how users interact with AI today. And that's precisely why it stands out in a highly competitive market.

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI
Book an Appointment with me: https://topmate.io/gaodalie_ai
Support the Content (every Dollar goes back into the video):https://buymeacoffee.com/gaodalie98d
Subscribe to the Newsletter for free: https://substack.com/@gaodalie

RAG Will Never Be the Same After Gemini File Search Tool

Gao Dalie (Ilyass) — Tue, 18 Nov 2025 22:44:50 +0000

Last week I heard bad news, and life hit me hard again. Moments like that remind me how fragile everything is — how one day we all leave, and even love can feel temporary.

In the middle of all this, I saw a post on X saying Gemini’s File Search Tool makes RAG super easy and is being offered at a really reasonable cost. I don’t know why, but something about it pushed me to try it

Google announced the File Search Tool, a fully managed search Augmentation generation (RAG) system built directly into the Gemini API.

Previously, to build a RAG, you had to choose a vector database, develop a chunking strategy, call an embedding model, and tie everything together. The file search tool handles all of that automatically behind the API.

These were major barriers for companies wanting to introduce AI, but with the introduction of the File Search Tool, these mechanisms can now be completed within the Gemini API.

Developers can simply upload files and use standard API calls to generate answers based on their own data. clearly indicating which part of which file the AI agent referenced when generating an answer. This helps prevent hallucination, a common problem with generative AI.

The File Search Tool helps developers build file search and ingestion pipelines in a simple, integrated, and flexible way to enhance Gemini answers with their own data. Storing files and generating embeddings at the time they are created for free, with a one-time fee only for the initial indexing of files.

So, let me give you a quick demo of a live chatbot to show you what I mean.

Check a video

During my development, one paper drew my attention. AI is increasingly involved across industries, influencing science and beyond. I will upload the Ocean AI PDF.

I will ask the chatbot a question: “What is Ocean AI, and why is Ocean AI different from OpenAI?” If you take a look at how the chatbot generates the output, you’ll see that the agent first saves my uploaded PDF to a temporary file, then creates a unique FileSearchStore with a random ID.

The agent uploads the PDF into this store and waits while Gemini breaks down the document into chunks and builds a searchable index — this is A wait_operation function that checks every 2 seconds until indexing finishes.

When I type my question and hit enter, query_file_search sends it to the Gemini api along with the store name. Gemini automatically searches through the indexed PDF chunks, finds the relevant sections about Ocean AI and how it differs from OpenAI, uses those chunks as context, and generates an answer using the selected model.

The response includes the answer text plus grounding metadata showing exactly which parts of the PDF were used, so when I click "View Sources", I can see the citations proving where the information came from. When I'm done, clicking "Clear PDF" deletes the entire store and cleans up all the data.

What makes File Search Tools?

The Gemini API File Search Tool consolidates these complex processes into a single, fully automated API callgenerateContent , allowing developers to leverage file search functionality within their existing APIs, eliminating the complex setup and management work previously required.

Unlike traditional keyword-based searches, the File Search Tool understands the meaning and context of your query and can find relevant information even if exact word matches are not used.

This is achieved through powerful vector search, leveraging the latest Gemini Embedding model.

Even more noteworthy is the implementation of auto-citation, which automatically includes citations to the specific documents used to generate the answer, greatly simplifying verification and fact-checking and making it much more useful for businesses.

Current limitations and expected improvements

The File Search Tool currently has some limitations. The most important limitation is the limited ability to adjust the number of chunks retrieved. During testing, we confirmed advanced configuration options such as metadata filters, but we hope that future enhancements will allow for more detailed control of the number of chunks.

There is also room for improvement in the accuracy of image recognition. Currently, it is possible to extract text from images, but it is not yet at the level of understanding the structure and relationships of diagrams. In particular, it can be difficult to extract meaningful information from documents written in Markdown format or with complex layouts.

File size limitations are also a consideration. Each file is limited to a maximum of 100MB, and the file search store size for the entire project is limited to 1GB-1TB, depending on the user tier. These limitations may affect practicality for large enterprises.

Differences from OpenAI/Anthropic

Currently, OpenAI’s Retrieval API and Anthropic’s File Contexts are well-known examples of RAG implementations. These systems use external storage to reference documents, but they require developers to build and manage a vector database, making them difficult to implement.

On the other hand, the File Search Tool completely automates this part and is done entirely within the Gemini API. The table below compares the three major RAG solutions.

As can be seen from this comparison, the File Search Tool is superior in terms of both development burden and operational costs, and is particularly suitable for prototype development and experimental use by individual developers.

In addition, the Gemini Embedding model provided by Google provides high search accuracy and is also a major attraction in that it can accurately extract information with similar meanings.

Let’s start coding :

Before we dive into our application, we will create an ideal environment for the code to work. For this, we need to install the necessary Python libraries.

pip install requirements

The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed and perform some basic configuration.

import streamlit as st
import os
import time
import random
import string
import tempfile
from pathlib import Path
from PyPDF2 import PdfReader
from google import genai
from google.genai import types
from dotenv import load_dotenv

I designed these helper functions to handle the main tasks of the app. First, I created get_text(key, lang='en') to get translated text - it just looks up a word or phrase in a translation dictionary, defaults to English if the language doesn't exist, and returns the original key if nothing is found.

Then I built generate_random_id(length=8) to make random IDs for naming stores - it randomly picks 8 characters from letters and numbers and combines them into a string.

I developed wait_operation(client, op, sleep_sec=2, max_wait_sec=300) to wait for background operations to finish - it keeps checking every 2 seconds if the operation is done by calling the API, and if it takes longer than 5 minutes, it stops waiting and throws an error so the app doesn't hang forever.

Next, I made extract_text_from_pdf(pdf_file, lang='en') to pull text out of PDF files - it opens the PDF, goes through each page one by one, grabs the text from each page, adds it all together with line breaks, and returns the complete text.

I wrapped this in error handling, so if the PDF is broken or can't be read, it shows an error message to the user instead of crashing.

def get_text(key, lang='en'):
    """Get translated text for the given key and language"""
    return TRANSLATIONS.get(lang, TRANSLATIONS['en']).get(key, key)

# Helper Functions
def generate_random_id(length=8):
    """Generate a random ID for store naming"""
    return ''.join(random.choices(string.ascii_lowercase + string.digits, k=length))

def wait_operation(client, op, sleep_sec=2, max_wait_sec=300):
    """Wait for Operations API to complete with timeout"""
    start = time.time()
    while not op.done:
        if time.time() - start > max_wait_sec:
            raise TimeoutError("Operation timed out.")
        time.sleep(sleep_sec)
        op = client.operations.get(op)
    return op

def extract_text_from_pdf(pdf_file, lang='en'):
    """Extract text content from uploaded PDF file"""
    try:
        pdf_reader = PdfReader(pdf_file)
        text = ""
        for page in pdf_reader.pages:
            text += page.extract_text() + "\n"
        return text
    except Exception as e:
        st.error(get_text('error_pdf_extract', lang).format(e))
        return None

Building on these utilities, I created three more functions to handle file management and storage. The save_uploaded_file(uploaded_file, lang='en') function takes care of saving uploaded files temporarily - it creates a temporary file that won't auto-delete, adds a .pdf extension to it, writes the uploaded file's content into it using getvalue(), and returns the file path so we can use it later with the other functions.

Next, create_file_search_store(client, store_name, lang='en') sets up a new storage space using the random ID from generate_random_id - It calls the API to create a file search store with a custom display name, returns the store object if successful, or shows an error get_text and returns None if it fails.

The last function, upload_file_to_store(client, file_path, store_name, display_name, lang='en'), actually uploads files into the store - it sends the file to the specified store using the API, adds some metadata like the source being "streamlit_upload" and a timestamp of when it was uploaded, then waits for the upload to complete using my wait_operation function from earlier, and returns the response once it's done.

def save_uploaded_file(uploaded_file, lang='en'):
    """Save uploaded file to temporary location"""
    try:
        with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file:
            tmp_file.write(uploaded_file.getvalue())
            return tmp_file.name
    except Exception as e:
        st.error(get_text('error_save_file', lang).format(e))
        return None

def create_file_search_store(client, store_name, lang='en'):
    """Create a new File Search Store"""
    try:
        store = client.file_search_stores.create(
            config={'display_name': store_name}
        )
        return store
    except Exception as e:
        st.error(get_text('error_create_store', lang).format(e))
        return None

def upload_file_to_store(client, file_path, store_name, display_name, lang='en'):
    """Upload file to File Search Store"""
    try:
        upload_op = client.file_search_stores.upload_to_file_search_store(
            file=file_path,
            file_search_store_name=store_name,
            config={
                'display_name': display_name,
                'custom_metadata': [
                    {"key": "source", "string_value": "streamlit_upload"},
                    {"key": "timestamp", "numeric_value": int(time.time())}
                ]
            }
        )
        upload_op = wait_operation(client, upload_op)
        return upload_op.response
    except Exception as e:
        st.error(get_text('error_upload_store', lang).format(e))
        return None

I built two final functions that let users interact with the uploaded files and clean up afterwards. The query_file_search(client, question, store_name, model, lang='en') function is where the magic happens - it takes a user's question and searches through the files in the store by calling the AI model with special file search tools configured.

It passes the question to the model along with a reference to the store name we created earlier, and the model automatically searches through all the uploaded files to find relevant information and generate an answer.

Like the other functions, it uses get_text for error messages and returns None if something goes wrong. After the user is done working with their files, cleanup_store(client, store_name, lang='en') handles the cleanup - it deletes the entire file search store, including all uploaded files, by calling the delete API with the force: True flag to make sure everything gets removed, returns True if successful or False If it fails, and shows an error message using the translation helper if anything breaks during deletion.

def query_file_search(client, question, store_name, model, lang='en'):
    """Query the File Search Store with a question"""
    try:
        response = client.models.generate_content(
            model=model,
            contents=question,
            config=types.GenerateContentConfig(
                tools=[
                    types.Tool(
                        file_search=types.FileSearch(
                            file_search_store_names=[store_name]
                        )
                    )
                ]
            )
        )
        return response
    except Exception as e:
        st.error(get_text('error_query', lang).format(e))
        return None

def cleanup_store(client, store_name, lang='en'):
    """Delete the File Search Store"""
    try:
        client.file_search_stores.delete(
            name=store_name,
            config={'force': True}
        )
        return True
    except Exception as e:
        st.error(get_text('error_cleanup', lang).format(e))
        return False

Conclusion :

The file search tool puts the advanced technology of RAG within the reach of all developers, not just a select few experts. This is truly the “democratisation of RAG.”

from worries about complex infrastructure and costs, developers can focus on developing more creative applications that directly address user challenges.

Combining your unique data with Gemini’s powerful intelligence will create new business value that was previously impossible. Let’s use this new tool to create the applications of the future!

Reference:

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI
Book an Appointment with me: https://topmate.io/gaodalie_ai
Support the Content (every Dollar goes back into the video):https://buymeacoffee.com/gaodalie98d
Subscribe to the Newsletter for free: https://substack.com/@gaodalie

DeepSeek-OCR + LLama4 + RAG Just Revolutionized Agent OCR Forever

Gao Dalie (Ilyass) — Wed, 29 Oct 2025 07:48:52 +0000

During the weekend, I scrolled through Twitter to see what was happening in the AI community. Once again, DeepSeek has drawn worldwide attention.

This isn’t just any text recognition tool — it’s a brand-new contextual optical compression technology that uses visual methods to solve the challenge of processing long texts, offering a completely new approach to handling massive amounts of document information.

Anyone who has used a large language model (LLM) has encountered a common pain point:

When you ask the model to summarise tens of thousands of words from conference notes or academic papers, it starts to lose its memory.

This is because the quadratic complexity of sequence length inherently limits GPT, Gemini, and Claude — the longer the input, the more computational power it requires.

But humans aren’t like that.
We can glance at a note or a diagram and instantly recall an entire passage.

Traditionally, for AI to understand long documents, the entire document must be converted into digital text. This process consumes a large number of tokens (which can be understood as the units used by AI to process information), resulting in low computational efficiency.

DeepSeek-OCR takes a different approach: it first converts text into images and then uses visual tokens to compress and represent this information. Imagine you have a 10,000-word article — instead of having AI read it word by word, it can simply “glance” at an image to understand and reconstruct the original text.

The core breakthrough lies in its ability to represent rich information in a single image containing document text using far fewer tokens than the equivalent text. This means that optical compression with visual tokens can achieve higher compression ratios, allowing us to do more with fewer resources.

So, let me give you a quick demo of a live chatbot to show you what I mean.

Check a video

I will ask the chatbot a question: “What are the main findings?” If you take a look at how the chatbot generates the output, you’ll see that the agent extract text from each page, but if a page contains less than 50 characters or lacks embedded text, it converts that page into a high-resolution image and sends it to DeepSeek-OCR on Replicate, which uses an innovative “Contextual Optical Compression” approach where it converts the document into visual tokens and compresses the information — essentially allowing the AI to “glance” at an image representation rather than reading word-by-word, which can turn a 10,000-word article into a much more efficient compressed format.

Once all text is extracted, the system breaks it into 500-character chunks with 50-character overlap to maintain context, converts each chunk into mathematical vectors using OpenAI embeddings, and stores them in a Chroma vector database that persists on disk for future use.

When you ask a question, the agent searches through these vectors to find the 5 most semantically similar document chunks, assembles them into a context prompt along with your question and instructions to cite page numbers, then sends everything to the Llama 3.1 405B model running on Replicate’s streaming API, which processes the prompt and generates an intelligent answer chunk-by-chunk in real-time.

Then generate the answer and the source document citations, showing which pages the information came from, creating a complete RAG agent that can understand any PDF

What makes DeepSeek-OCR Unique?

DeepSeek-OCR is an end-to-end OCR and document parsing model designed to achieve optical context compression.

This model consists of two major components: a DeepEncoder that compresses high-resolution image input into a small number of visual tokens, and a DeepSeek-3B-MoE decoder (a Mixture-of-Experts language model) that restores the original text from the visual token sequence.

DeepEncoder (approximately 380 million parameters) incorporates a SAM-based window attention mechanism for local image feature extraction, and by inserting a two-layer CNN with 16x compression in between, it significantly compresses a 1024x1024 pixel image from 4096 patches to around 256 tokens.

The decoder side, which receives these visual tokens, has a total of 3 billion parameters (approximately 570 million are effective during inference) and features a MoE structure that dynamically utilises 6 experts per step from a pool of 64 experts, allowing for lightweight yet efficient text reconstruction.

With this architecture, DeepSeek-OCR takes an unconventional approach by converting the contents of a text document into an “image” and then reading it.

PaddleOCR-VL Vs DeepSeek-OCR :

Check the video PaddleOCR-VL: Video

When I tested both OCR models, I found something interesting — PaddleOCR-VL, which has fewer parameters (0.9B), was beating much larger 3B models in real-world tests.

I gave it tough jobs: reading vertical text in the right direction, understanding complex math formulas, and handling documents with multiple columns — and PaddleOCR-VL nailed them all, while DeepSeek-OCR made mistakes with reading order and formulas, even though it has cool compression features.

Then I discovered something fun in DeepSeek-OCR’s research paper — they actually thanked PaddleOCR and admitted they used it to label their training data, which made me realize why companies like Baidu, DeepSeek, and Shanghai AI Lab are all releasing OCR models: they’re not making OCR tools as their main job, they’re building them to clean up huge amounts of data for training their AI models, and we’re getting these powerful OCR tools as free bonuses.

After testing everything, I figured out that if you’re building something for real work and need to read printed text, forms, tables, or documents in different languages, PaddleOCR-VL is the way to go, while DeepSeek-OCR is better if you’re a researcher trying to compress data to save money on AI costs.

Text Tokens vs. Visual Tokens: The Fundamental Difference

In traditional LLMs, text is broken down into discrete text tokens (typically words or subwords). Each token is assigned a fixed ID in the vocabulary and mapped into a vector via a large “lookup table” (embedded layer). While this process is efficient, its expressive power is limited by the limited vocabulary.

Visual Tokens are completely different. Instead of coming from a fixed lookup table, they are continuous vectors generated directly from image pixels by a neural network (visual encoder). This means:

Higher information density: Visual tokens exist in a continuous vector space and can encode richer and more nuanced information than discrete text tokens. A visual token can represent the color, shape, texture, and spatial relationships within an area, rather than just a word or subword.

Global pattern perception: The visual encoder can capture global information, such as the overall layout, typesetting, and font style of the text, which is lost in the plain text token sequence.

Larger expression space: In theory, the “vocabulary” of visual tokens is infinite because they are continuous vectors generated directly from pixels rather than selected from a fixed dictionary.

Let’s start coding :

Before we dive into our application, we will create an ideal environment for the code to work. For this, we need to install the necessary Python libraries.


pip install requirements

The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed and perform some basic configuration.

import os
import replicate
from langchain_openai import OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_core.language_models.llms import LLM
from typing import List, Optional, Any
import fitz
from pathlib import Path
from dotenv import load_dotenv

load_dotenv()

I developed this custom Llama class by inheriting from LangChain's base LLM class and configuring it with the Llama 3.1 405B model identifier, token limits, and temperature settings.

I implemented the required _llm_type property to return an identifier, then I built the core _call method, which takes a prompt, packages it with the configuration into a dictionary, sends it to Replicate's streaming API, and loops through the response chunks to concatenate them into a complete answer.

class Llama(LLM):
    model: str = "meta/meta-llama-3.1-405b-instruct"
    max_tokens: int = 1024
    temperature: float = 0.7

    @property
    def _llm_type(self) -> str:
        return "replicate_llama"

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        input_data = {
            "prompt": prompt,
            "max_tokens": self.max_tokens,
            "temperature": self.temperature
        }

        output = ""
        for event in replicate.stream(self.model, input=input_data):
            output += str(event)

        return output

I built this OCRPDFLoader class to extract text from PDFs by first trying text extraction and falling back to OCR when needed. I initialised it with a file path, an optional OCR flag, and a text threshold (default 50 characters) to detect if a page has enough text.

In the load method, I opened the PDF with PyMuPDF, looped through each page to extract text, then checked if OCR was forced or if the extracted text was below the threshold - if so,

I called my _ocr_page method, which I built to convert the page into a high-resolution PNG image, send it to Replicate's DeepSeek-OCR API, get the OCR text back, clean up the temporary image, and return the extracted text.

Finally, I packaged each page's text into LangChain Document objects with metadata (source file, page number, filename) and returned them as a list, giving me a smart loader that automatically handles both digital and scanned PDFs.

class OCRPDFLoader:
    def __init__(self, file_path: str, use_ocr: bool = False, text_threshold: int = 50):
        self.file_path = file_path
        self.use_ocr = use_ocr
        self.text_threshold = text_threshold

    def load(self) -> List[Document]:
        doc = fitz.open(self.file_path)
        documents = []

        for page_num in range(len(doc)):
            page = doc[page_num]
            text = page.get_text()

            if self.use_ocr or len(text.strip()) < self.text_threshold:
                print(f"OCR: page {page_num + 1}")
                text = self._ocr_page(page, page_num)

            if text.strip():
                documents.append(Document(
                    page_content=text.strip(),
                    metadata={
                        'source': self.file_path,
                        'page': page_num + 1,
                        'filename': Path(self.file_path).name
                    }
                ))

        doc.close()
        return documents

    def _ocr_page(self, page, page_num, temp_dir='./temp_ocr'):
        os.makedirs(temp_dir, exist_ok=True)

        pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
        img_path = f"{temp_dir}/page_{page_num}.png"
        pix.save(img_path)

        with open(img_path, "rb") as image_file:
            input_data = {
                "image": image_file,
                "task_type": "Free OCR"
            }

            output = replicate.run(
                "lucataco/deepseek-ocr:cb3b474fbfc56b1664c8c7841550bccecbe7b74c30e45ce938ffca1180b4dff5",
                input=input_data
            )

        os.remove(img_path)
        return output

Next, I built this LangChainPDFRAG. The lass is the main orchestrator that ties everything together into a complete RAG system. I initialised it by setting up my custom Llama model for generating answers, OpenAI embeddings for converting text into vectors, a text splitter that breaks documents into 500-character chunks with 50-character overlap to maintain context between chunks, and a Chroma vector database that I configured to persist on disk so it could reload existing data between sessions.

I created the add_pdf method, which uses my OCR loader to extract text from PDFs, splits that text into manageable chunks, then either creates a new vector store or adds to an existing one by converting each chunk into embeddings and storing them for semantic search.

Finally, I implemented the query method where I set up a retriever to find the 5 most relevant document chunks, built a LangChain chain that takes a user's question, retrieves relevant context, formats it into a prompt template asking the LLM to cite page numbers, passes everything to my Llama model for generation, and returns both the generated answer and the source documents with their page numbers - essentially creating a complete question-answering system that can intelligently search through PDFs and provide accurate, cited responses.

class LangChainPDFRAG:
    def __init__(self, 
                 llm_model='meta/meta-llama-3.1-405b-instruct',
                 embedding_model='text-embedding-3-small',
                 persist_directory='./chroma_db'):

        self.llm = Llama(model=llm_model)
        self.embeddings = OpenAIEmbeddings(model=embedding_model)
        self.persist_directory = persist_directory
        self.vectorstore = None

        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50,
            separators=["\n\n", "\n", ". ", " ", ""]
        )

        if os.path.exists(persist_directory):
            self.vectorstore = Chroma(
                persist_directory=persist_directory,
                embedding_function=self.embeddings
            )

    def add_pdf(self, pdf_path: str, use_ocr: bool = False):
        loader = OCRPDFLoader(pdf_path, use_ocr=use_ocr)
        documents = loader.load()
        splits = self.text_splitter.split_documents(documents)

        if self.vectorstore is None:
            self.vectorstore = Chroma.from_documents(
                documents=splits,
                embedding=self.embeddings,
                persist_directory=self.persist_directory
            )
        else:
            self.vectorstore.add_documents(splits)

        print(f"Added {len(splits)} chunks from {Path(pdf_path).name}")
        return len(splits)

    def query(self, question: str):
        if self.vectorstore is None:
            raise ValueError("No documents.")

        retriever = self.vectorstore.as_retriever(search_kwargs={"k": 5})

        def format_docs(docs):
            return "\n\n".join([doc.page_content for doc in docs])

        prompt = ChatPromptTemplate.from_template(
            "You are a helpful assistant. Answer based on the context provided. Cite page numbers when relevant.\n\n"
            "Context:\n{context}\n\n"
            "Question: {question}\n\n"
            "Answer:"
        )

        chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt
            | self.llm
            | StrOutputParser()
        )

        docs = retriever.invoke(question)
        answer = chain.invoke(question)

        return {
            'answer': answer,
            'sources': [
                {
                    'filename': doc.metadata.get('filename'),
                    'page': doc.metadata.get('page'),
                    'content': doc.page_content[:200]
                }
                for doc in docs
            ]
        }

I instantiated the RAG system with Llama 3.1 405B, loaded a PDF into the vector database, and queried it with a question. The Agent retrieved relevant document chunks, generated an answer, and returned both the answer and source citations

if __name__ == "__main__":
    # Using Llama 3.1 405B from Replicate
    rag = LangChainPDFRAG(llm_model='meta/meta-llama-3.1-405b-instruct')

    rag.add_pdf('TSLA-Q2-2025-Update.pdf', use_ocr=False)

    result = rag.query('What are the main findings?')

    print("=== Answer ===")
    print(result['answer'])

    print("\n=== Sources ===")
    for source in result['sources']:
        print(f"- {source['filename']}, Page {source['page']}")

Conclusion :

DeepSeek-OCR is not just a more powerful OCR tool, but a research paper that opens a new chapter. The concept of visual-text compression that it proposes offers an imaginative path to solving one of the biggest challenges facing current large-scale models: the bottleneck of long context processing efficiency.

By “rendering” textual information as two-dimensional images and compressing it into information-dense visual tokens using an efficient visual encoder, DeepSeek-OCR demonstrates that AI can “see images” like humans can, allowing it to understand and remember large amounts of information more efficiently.

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI

Book an Appointment with me: https://topmate.io/gaodalie_ai

Support the Content (every Dollar goes back into the video):https://buymeacoffee.com/gaodalie98d

Subscribe to the Newsletter for free: https://substack.com/@gaodalie

PaddleOCR VL + RAG: Revolutionize Complex Data Extraction (Open-Source)

Gao Dalie (Ilyass) — Fri, 24 Oct 2025 16:55:07 +0000

Not even a month ago, I made a video about MistralOCR that many of you liked.

After that, a follower reached out with a problem they were having with an OCR Chatbot. I figured this was a common issue, so I decided to make a new video to help them and other developers.

When documents contain complex tables, mathematical formulas, or multi-column layouts, traditional OCR tools often generate messy content that requires manual sorting.

Then, just last week, I was browsing GitHub and came across Baidu's newly open-sourced PaddleOCR-VL-0.9B.

I'll be honest - when I saw it had only 0.9 billion parameters, my first thought was " Oh, another small model joining the fun?" But out of professional curiosity, I had to ask: could this one actually deliver? What I found completely stunned me.

This isn't OCR, it's a quantum leap in document understanding
PaddleOCR-VL completely exceeded my expectations. It achieved the world's first place in comprehensive performance, scoring 92.6 on the global authoritative evaluation list, OmniDocBench v1.5. Its inference speed increased by 14.2% compared with MinerU2.5 and 253.01% compared with dots.ocr.

The most intuitive feeling I had was that it was very accurate, or too accurate! It is worthy of being the model that can reach the top and be ranked first.

So, let me give you a quick demo of a live chatbot to show you what I mean.

Check a video

Today, I'll be putting PaddleOCR-VL to the test on four key challenges: Formula Recognition, Table Recognition, Reading Order, and Handwritten Text.

Let's start with Formula Recognition. I've uploaded an image containing complex mathematical formulas. As you can see, the model handles them exceptionally well - accurately interpreting superscripts, subscripts, and even very long, intricate expressions.
Next up is Table Recognition.

This is a notoriously difficult problem, and there are many types of tables, sometimes with borders and sometimes without, containing numerous numbers that are very easy for models to misinterpret. I used PaddleOCR-VL on several table examples and found its accuracy to be genuinely impressive.

Another major challenge is understanding document Structure and Reading Order. In modern documents, content is not only more complex but also comes in highly varied layouts. Think multi-column designs, mixed text and images, folds, color printing, tilted scans, and handwritten annotations - all of which complicate OCR. The correct reading order isn't always a simple top-to-bottom, left-to-right flow.

The PaddleOCR-VL technical report demonstrates how the model can understand these complex structures, almost like a human. Whether it's an academic paper, a multi-column newspaper, or a technical report, it intelligently analyzes the layout and restores a reading order that matches human intuition.

Finally, PaddleOCR-VL remains extremely stable even with more complex layouts. Take this handwritten note, for example. It combines text, numbers, paragraphs, and images in a layout with left-right and top-bottom columns that typically only a human could decipher.

What Makes PaddleOCR VL Unique?

PaddleOCR VL is no longer just simple text recognition, but can really "understand" the document structure. Whether it is an academic paper, a multi-column newspaper or a technical report, PaddleOCR-VL can intelligently understand the document layout and automatically organise the content in the correct order.

At the same time, it accurately extracts complex content information, such as tables, mathematical formulas, handwritten notes, and chart data in documents. It converts them into structured data that can be directly used.

In addition, it supports recognition of 109 languages, covering multilingual scenarios such as Chinese, English, French, Japanese, Russian, Arabic, and Spanish, greatly improving the model's recognition and processing capabilities in multilingual documents.

How PaddleOCR VL It Trained

PaddleOCR-VL consists of two parts: PP-DocLayoutV2 and PaddleOCR-VL-0.9B.

Among them, the core part is PaddleOCR-VL-0.9B, which integrates a pre-trained visual encoder with a dynamic resolution preprocessor, a two-layer MLP projector, and a pre-trained large language model.
The preprocessing technology uses native dynamic high resolution. The visual encoder uses the NaViT style encoder, which supports native resolution input.

This design reduces hallucinations and improves the performance of the visual language model PaddleOCR-VL-0.9B.
The projector efficiently connects the features of the visual encoder to the embedding space of the language model.

In an autoregressive language model, the entire sequence is generated by predicting one token at a time. This means that the size of the decoder directly affects the overall inference latency, so smaller models decode faster.

Let's start coding

Let us now explore step by step and unravel the answer to creating a powerful reasoning app. We will install the libraries that support the model. For this, we will do a pip install

!pip uninstall -y torch paddlepaddle paddlepaddle-gpu
!pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
!pip install paddleocr paddlepaddle
!pip install langchain langchain-community langchain-openai faiss-cpu sentence-transformers openai python-dotenv

The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed and perform some basic configuration.

PaddleOCR: converts documents and images into structured, AI-friendly data (like JSON and Markdown) with industry-leading accuracy - powering AI applications.

import torch
from paddleocr import PaddleOCR
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.docstore.document import Document

So I built this SimpleRAG system that combines PaddleOCR-VL for text extraction with OpenAI for generating queries. Let me walk you through what I developed here.

In the initialisation, I set up the core components - I'm using HuggingFace's BGE embeddings for vector representations and GPT-4o as the chat model with zero temperature for consistent responses. I initialize placeholders for the vectorstore and QA chain that we'll build later.

Now, for the extraction method, first I tried using the HuggingFace transformers version of PaddleOCR, which threw a weird error about image tokens not matching, then installing PaddlePaddle actually broke PyTorch (had to restart the runtime and reinstall everything in the right order), then I kept guessing at the API because the methods were deprecated and the new ones had different parameters.

The real breakthrough came when I just printed out what the result object actually looked like - turns out it's just a list with one dictionary inside, and that dictionary has a key called rec_texts which is literally just a list of all the text strings that were found in the image.

So instead of trying to access some complex nested object structure with .boxes.text I just needed to check if the result was a dictionary, grab the rec_texts key, and extend my list with those strings.

class SimpleRAG:
    def __init__(self):
        self.embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
        self.vectorstore = None
        self.qa_chain = None
        self.ocr = PaddleOCR(use_textline_orientation=True, lang='en')

    def extract_text_from_images(self, image_paths: list):
        docs = []
        for path in image_paths:
            result = self.ocr.predict(input=path)

            text_lines = []
            for res in result:
                if isinstance(res, dict) and 'rec_texts' in res:
                    text_lines.extend(res['rec_texts'])

            text = "\n".join(text_lines) if text_lines else "No text found"
            docs.append(Document(page_content=text, metadata={'source': path}))

        return docs

In build_index, extract text from all images, split the documents into 1000-character chunks with 200-character overlap using RecursiveCharacterTextSplitter, create a FAISS vectorstore with BGE embeddings, and set up a RetrievalQA chain that uses GPT-4o and retrieves the top 3 relevant chunks per query.

For a query, I just pass the question to the QA chain, which handles retrieval and generation, returning the answer.

def build_index(self, image_paths: list):
        docs = self.extract_text_from_images(image_paths)

        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
        splits = text_splitter.split_documents(docs)

        self.vectorstore = FAISS.from_documents(splits, self.embeddings)
        self.qa_chain = RetrievalQA.from_chain_type(
            llm=self.llm,
            retriever=self.vectorstore.as_retriever(search_kwargs={"k": 3})
        )
def query(self, question: str):
        return self.qa_chain.invoke(question)

# Usage
rag = SimpleRAG()
rag.build_index(["Your pic"])
answer = rag.query("extract all the table?")
print(answer)

Conclusion :

In this era of rapidly advancing AI technology, we're often bombarded with hype about "the most powerful ever" and "disruptive." However, truly valuable breakthroughs often come from innovations that solve specific problems and make technology easier to use.

PaddleOCR-VL may not make mainstream headlines, but for developers who need to process documents every day, it may be the long-awaited solution.

After all, the best technologies are those that are quietly integrated into daily work, making you hardly aware of their existence. PaddleOCR-VL is taking a solid step in this direction.

🧙‍♂️ I am an AI Generative expert! If you want to collaborate on a project, drop an inquiry here or book a 1-on-1 Consulting Call With Me.

I would highly appreciate it if you

❣ Join my Patreon: https://www.patreon.com/GaoDalie_AI

Book an Appointment with me: https://topmate.io/gaodalie_ai

Support the Content (every Dollar goes back into the video):https://buymeacoffee.com/gaodalie98d

Subscribe to the Newsletter for free: https://substack.com/@gaodalie