<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gao Dalie (Ilyass)</title>
    <description>The latest articles on DEV Community by Gao Dalie (Ilyass) (@gaodalie_ai).</description>
    <link>https://dev.to/gaodalie_ai</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1062623%2F4b42a880-5d73-4838-9ba3-2e721f64839f.png</url>
      <title>DEV Community: Gao Dalie (Ilyass)</title>
      <link>https://dev.to/gaodalie_ai</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gaodalie_ai"/>
    <language>en</language>
    <item>
      <title>How to build Claude Skills 2.0 Better than 99% of People</title>
      <dc:creator>Gao Dalie (Ilyass)</dc:creator>
      <pubDate>Sun, 15 Mar 2026 19:37:07 +0000</pubDate>
      <link>https://dev.to/gaodalie_ai/how-to-build-claude-skills-20-better-than-99-of-people-1pj7</link>
      <guid>https://dev.to/gaodalie_ai/how-to-build-claude-skills-20-better-than-99-of-people-1pj7</guid>
      <description>&lt;p&gt;“It’s a pain to give the same instructions to the AI ​​every time…” “The AI ​​never remembers the company’s rules and formats…” “Everyone on the team uses the AI ​​in different ways, so in the end, only those who are good at it benefit…”&lt;/p&gt;

&lt;p&gt;Anthropic has just upgraded an incredible feature that can solve these problems. It’s called “Claude Skills .” This isn’t just an update to the AI agent. It’s a next-generation feature that lets you teach the AI your business processes and specialised knowledge so that it can grow into the ultimate expert tailored to your company’s needs.&lt;/p&gt;

&lt;p&gt;Claude Skills is the most powerful feature for teaching Claude specific tasks and workflows. What I’ve found most exciting about using it is that it eliminates the need to explain your preferences and processes in every conversation.&lt;/p&gt;

&lt;p&gt;A Skill is a set of instructions packaged in a simple folder that you can set up once and benefit from every time. It really shines when you have a consistent workflow, such as generating front-end designs from specs or creating documents in line with your team’s style guide.&lt;/p&gt;

&lt;p&gt;In my experience, skills are not just “macros” or “templates,” but act as a “knowledge base” that enhances Claude’s decision-making abilities. They work particularly well with built-in functions such as code execution and document creation, allowing him to process even complex tasks seamlessly.&lt;/p&gt;

&lt;h1&gt;
  
  
  What are Skills?
&lt;/h1&gt;

&lt;p&gt;Skills are reusable pieces of knowledge and procedures that Claude Code can refer to to perform specific tasks. Each Skill is primarily defined as a Markdown file (SKILL.md) and can include associated scripts and resources as needed.&lt;/p&gt;

&lt;p&gt;Claude Code loads the appropriate Skill in response to a user request and executes the task according to the instructions. This allows you to automate complex workflows and routine tasks consistently.&lt;/p&gt;

&lt;p&gt;Claude Code itself acts as a reference knowledge base, allowing you to directly execute scripts and manage workflows, so you can define and execute what should be done at what time using a rule-based system.&lt;/p&gt;

&lt;h1&gt;
  
  
  What is Skill-creator?
&lt;/h1&gt;

&lt;p&gt;Skill-creator is a “meta skill” that allows you to create, test, and improve skills in one go.&lt;/p&gt;

&lt;p&gt;Roughly speaking, what a skill-creator does is the following five things.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ask “What kind of skills do you want to develop?”&lt;/li&gt;
&lt;li&gt;SKILL.md: Automatically generate a draft of&lt;/li&gt;
&lt;li&gt;Test it by actually running it with the test prompt&lt;/li&gt;
&lt;li&gt;Evaluate the results and propose improvements&lt;/li&gt;
&lt;li&gt;Repeat steps 2 to 4 until you are satisfied&lt;/li&gt;
&lt;li&gt;The idea is that Claude Code itself will go through the cycle of manual SKILL.md writing, trying, and fixing things.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  How to write skiLL.md?
&lt;/h1&gt;

&lt;p&gt;File structure&lt;/p&gt;

&lt;p&gt;Basically, you can create a skill by simply creating .claude/skillsa folder and a file for the skill under SKILL.md&lt;/p&gt;

&lt;p&gt;SKILL.md. The contents will be as follows, written in YAML and Markdown.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Your Skill Name&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Brief&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;what&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;this&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Skill&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;does&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;when&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;use&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;it"&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Your Skill Name&lt;/span&gt;
&lt;span class="gu"&gt;## Instructions&lt;/span&gt;
Provide clear, step-by-step guidance for Claude.
&lt;span class="gu"&gt;## Examples&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Show concrete examples of using this Skill.&lt;br&gt;
First, I will explain the YAML part.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Your Skill Name&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Brief description of what this Skill does and when to use it&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is called metadata and is a very important part of creating a skill.&lt;/p&gt;

&lt;p&gt;Claude reads the metadata at startup and only knows when each skill exists and when it’s available, incorporating it into the system prompt. This approach lets you have many skills without unnecessarily bloating your context.&lt;/p&gt;

&lt;p&gt;When a prompt or request matches the skill’s metadata, Claude SKILL.md reads it from the file system.&lt;/p&gt;

&lt;p&gt;The accuracy of whether or not something is actually executed depends heavily on the content of the metadata, so this is a very important factor.&lt;/p&gt;

&lt;p&gt;Next, I will explain the content section.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;
&lt;span class="gh"&gt;# Your Skill Name&lt;/span&gt;
&lt;span class="gu"&gt;## Instructions&lt;/span&gt;
Provide clear, step-by-step guidance for Claude.
&lt;span class="gu"&gt;## Examples&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Show concrete examples of using this Skill.&lt;br&gt;
The metadata is always loaded when Claude starts up, but the content part is loaded at runtime. Then, when an agent skill is executed, Claude will process the contents in the content part.&lt;/p&gt;

&lt;p&gt;Official best practices recommend keeping your SKILL.md under 500 lines. If it exceeds that, split out detailed reference material into a separate file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.claude/skills/
  my-skill/
    SKILL.md          ← Main instructions (within 500 lines)
    templates/        ← Template files
    reference.md      ← Detailed reference material
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use instructions in SKILL.md Read to guide Claude to load additional files only when needed, unpacking as needed rather than loading everything at once. This is Progressive Disclosure: provide the core instructions first, unpacking the details as needed.&lt;/p&gt;

&lt;p&gt;The key here is that even if Agent Skills is doing an efficient job of loading, it’s best to keep the content part brief as well — when Claude loads the content part, it will compete with the conversation history and other context.&lt;/p&gt;

&lt;p&gt;Therefore, CLAUDE.mdomit general references to system prompts, programming languages, libraries, etc. in the content section. One of the tricks to creating a highly accurate skill is to determine which parts to omit and where to begin in the content section.&lt;/p&gt;

&lt;h1&gt;
  
  
  Why is this skill needed?
&lt;/h1&gt;

&lt;p&gt;When responding to PR review comments, we faced the following challenges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It’s time-consuming to check every review comment&lt;/li&gt;
&lt;li&gt;It’s hard to know which comments are unaddressed&lt;/li&gt;
&lt;li&gt;It’s a hassle to communicate the contents of review comments to Claude Code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By using this Skill, Claude Code will automatically retrieve unaddressed comments and suggest fixes.&lt;/p&gt;

&lt;h1&gt;
  
  
  How Skills Work
&lt;/h1&gt;

&lt;p&gt;A Skill is simply a folder containing commands.&lt;br&gt;
At the heart of a Skill is a folder containing a SKILL.md file. This Markdown file uses YAML Front Matter to define metadata (such as name and description), while the main body contains clear, step-by-step task instructions and examples.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
my-Skill.zip
  my-Skill/
  Skill.md
  resources/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude will automatically discover and load the relevant skills.&lt;br&gt;
No manual skill triggering is required. At the start of a session, Claude scans the metadata (name and description) of all installed skills and loads this brief information into its system prompt.&lt;/p&gt;

&lt;p&gt;When your request matches the description of a skill, Claude automatically reads and loads the complete instructions for that skill.&lt;/p&gt;

&lt;p&gt;The “progressive disclosure” mechanism makes Skills extremely efficient.&lt;br&gt;
Skill uses a three-layer structure (YAML preface, body, and file references) to gradually and on-demand feed information into the model context, avoiding a one-time overload and improving efficiency and token economy.&lt;/p&gt;

&lt;p&gt;Skills are designed with token efficiency in mind. Upon initial loading, each Skill uses only a few dozen tokens to store its metadata. The detailed instructions for a Skill are only displayed in the context window when it is triggered.&lt;/p&gt;

&lt;p&gt;This on-demand loading mechanism means you can install a large number of Skills without impacting model performance due to a full context window.&lt;/p&gt;

&lt;p&gt;For more complex Skills, different instructions can be split across multiple files, and Claude reads only the parts needed for the current task, further conserving tokens.&lt;/p&gt;
&lt;h1&gt;
  
  
  MCP Vs Skills
&lt;/h1&gt;

&lt;p&gt;Skills are another powerful layer for those already using MCP (Model Context Protocol). I find the relationship best understood with the analogy of a kitchen and a recipe.&lt;/p&gt;

&lt;p&gt;MCP provides a professional kitchen, giving you access to the tools, ingredients, and equipment. Skills, on the other hand, are recipes that provide step-by-step instructions for creating something of value.&lt;/p&gt;

&lt;p&gt;Combining these two allows users to accomplish complex tasks without having to figure out all the steps themselves. When I first built the MCP server, I thought that just providing access to the tools would be enough, but in reality, there was a lack of workflow guidance on how to use the tools, which confused users.&lt;/p&gt;

&lt;p&gt;After introducing Skills, a clear division of roles was created: MCP defined what can be done, and Skills taught how to do it, and the user experience improved dramatically.&lt;/p&gt;

&lt;p&gt;Installing this Skill&lt;/p&gt;

&lt;p&gt;If you want to install Claude code in its best form, with all the best features, you have come to the right place. Make sure you have VS Code. If you don’t have it, go and install it. In this video, I will not cover how to install.&lt;/p&gt;

&lt;p&gt;Let’s go ahead and open that, and now, we actually have a new project, so from here, let’s go and install Claude code inside of VS Code and open up the extensions, click search and type Claude code. Make sure that you see the verification symbol and install the extension&lt;/p&gt;

&lt;p&gt;After you install Claude code, look at the very top, and you will see the little logo and click on that&lt;/p&gt;

&lt;p&gt;Skills are actually a form of plugin. We use an anthropics/skills marketplace. Install Skills through plugins in the marketplace, and Claude will automatically load them when needed.&lt;/p&gt;

&lt;p&gt;Add Skills plugin marketplace&lt;br&gt;
You can also enter /plugin. The following is to add a plugin marketplace: Then, enter the official GitHub Skills address:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://github.com/anthropics/skills
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install Skills plugin&lt;br&gt;
After adding the market, you will be prompted to install skill plugins:&lt;/p&gt;

&lt;p&gt;You can also quickly install Skills using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;/plugin install document-skills@anthropic-agent-skills
/plugin install example-skills@anthropic-agent-skills
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The official uses of the two skill plugins are as follows:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;document-skills&lt;/strong&gt;: A package of document skills that can handle documents such as Excel, Word, PPT, and PDF.&lt;br&gt;
&lt;strong&gt;example-skills&lt;/strong&gt;: Sample skill sets that can handle skill creation, MCP building, visual design, algorithmic art, web testing, Slack GIF creation, theme styling, and more.&lt;br&gt;
Installation successful. You can view the added skill plugins and the marketplace by /pluginentering the command prompt. Select marketplace&lt;/p&gt;

&lt;p&gt;You can access the skill plugin via /plugincommand line Manage plugins to perform operations such as updating and deleting:&lt;/p&gt;

&lt;p&gt;After installation, we're going to check if the /skill-creator available. I am going to ask Claude for Claude's code&lt;/p&gt;

&lt;p&gt;do u have the Skill Creator skill, and what does it do?&lt;br&gt;
You can see right here that we do have that, so I will switch to plan mode and ask it to build us a new skill.&lt;/p&gt;

&lt;p&gt;I want you to create a skill that helps me plan a complete one-month app launch&lt;br&gt;
.I need it to break down the launch into manageable weekly chunks - first &lt;br&gt;
two weeks for getting everything ready (finishing features, creating app &lt;br&gt;
store materials, setting up marketing), third week for the actual launch &lt;br&gt;
(testing with a small group first, reaching out to press, going live), &lt;br&gt;
and the final week for monitoring how it's doing and making quick fixes. &lt;br&gt;
Include some templates I can actually use, like launch checklists and social &lt;br&gt;
media posts. The skill should activate whenever I mention things like "app launch plan" or &lt;br&gt;
"launch my app in 30 days." It should work whether I'm launching an iPhone app,&lt;/p&gt;

&lt;p&gt;The main text should only include things that Claude doesn’t know.&lt;br&gt;
The Skill-Creator guide has this to say:&lt;/p&gt;

&lt;p&gt;Default assumption: Claude is already very smart. Only&lt;br&gt;
add context Claude doesn’t already have.&lt;/p&gt;

&lt;p&gt;The basic premise is that Claude is intelligent to begin with, so writing general knowledge or programming basics in SKILL.md will only waste tokens.&lt;/p&gt;

&lt;p&gt;You should focus on information you wouldn’t know (company-specific rules, quirks of specific libraries, domain-specific workflows, etc.).&lt;/p&gt;

&lt;p&gt;It is recommended to avoid lengthy explanations and use an imperative and concise writing style.&lt;/p&gt;

&lt;p&gt;Match the “degrees of freedom” of instructions to the task&lt;br&gt;
It’s not necessary to specify everything in great detail; the key is to adjust the granularity of your instructions to suit the task.&lt;/p&gt;

&lt;p&gt;High degree of freedom (text-based instructions)… When multiple approaches, such as writing, are effective&lt;br&gt;
Moderate flexibility (pseudocode or scripts with parameters) — There is a recommended pattern, but some variation is OK&lt;br&gt;
Low flexibility (specific scripts, few parameters) … When consistency of procedures is crucial, and mistakes are fatal&lt;br&gt;
What types of Claude Skills are there, and where can I find them?&lt;br&gt;
In terms of usage, there are two types: Claude currently supports using the official built-in Skill and locally uploaded Skills.&lt;/p&gt;

&lt;p&gt;Based on the source of the skill, it can be divided into three types:&lt;/p&gt;

&lt;p&gt;Official Skills, provided by Anthropic and its partners.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://github.com/anthropics/skills
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Press enter or click to view the image in full size&lt;/p&gt;

&lt;p&gt;Claude. ai For example, the logic code behind those smooth features you use in the web version—such as "develop a web application for me," "analyse &lt;a href="https://skillsmp.com/" rel="noopener noreferrer"&gt;https://skillsmp.com/&lt;/a&gt; this PDF document," and "write a Snake game and preview it"—is all in this repository!&lt;/p&gt;

&lt;p&gt;You create Custom Skills and are suitable for users who need personalised customisation. Use Skill Creator to create and upload Skill files.&lt;/p&gt;

&lt;p&gt;Community skills, shared by other users, are readily available and much faster than reinventing the wheel, making them ideal for skill selection and modification. Simply download and upload; however, be aware of security risks before use.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://skillsmp.com/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Press enter or click to view the image in full size&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://www.aitmpl.com/skills
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Press enter or click to view the image in full size&lt;/p&gt;

&lt;p&gt;How do you determine if a task is suitable to be made into a Skill?&lt;br&gt;
When you find yourself frequently requesting the same type of tasks from Claude, or have templates or assets that need to be used repeatedly, such as:&lt;/p&gt;

&lt;p&gt;“Help me write the weekly report using the company’s template”: You need to write a team weekly report every week, and each time you need to tell Claude to organize the content according to three parts: “This week’s achievements, difficulties encountered, and next steps.” At this point, you can create a “Team Weekly Report Generator” skill.&lt;/p&gt;

&lt;p&gt;“Create presentations in our company’s style”: Often, it’s essential to strictly adhere to brand guidelines, including logo usage, brand colors, company name, company business content, and professional expectations. You can package these guidelines into a “Brand Presentation Style” skill.&lt;/p&gt;

&lt;p&gt;“Organizing market analysis reports/conducting competitor research using a specific format”: For example, creating a market analysis report might require combining three sets of competitor data, one set of internal sales data, and applying a fixed analytical framework. This entire complex process can be encapsulated into a “market analysis report” skill.&lt;br&gt;
Conversely, if it’s just an occasional, one-off request, you can simply state it in the chat, and there’s no need to create a Skill.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion :
&lt;/h1&gt;

&lt;p&gt;Claude Skills is an absolute must-have for people who frequently perform repetitive, routine tasks. It transforms your “unclear work experience” into “explicit rules” that AI can understand, allowing Anthropic’s tools to be perfectly adapted to your needs.&lt;/p&gt;

&lt;p&gt;Whether you’re a product owner, project manager, copywriter, or anyone using Claude in the workplace, you can rely on it to reduce repetitive work and ensure consistent output — that’s the core value of Skills.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>The New Nano Banana 2 + OCR + Claude Code = Powerful AI OCR PDF Editor</title>
      <dc:creator>Gao Dalie (Ilyass)</dc:creator>
      <pubDate>Sun, 08 Mar 2026 18:18:19 +0000</pubDate>
      <link>https://dev.to/gaodalie_ai/the-new-nano-banana-2-ocr-claude-code-powerful-ai-ocr-pdf-editor-1ha1</link>
      <guid>https://dev.to/gaodalie_ai/the-new-nano-banana-2-ocr-claude-code-powerful-ai-ocr-pdf-editor-1ha1</guid>
      <description>&lt;p&gt;Yesterday, when I was trying to draw an illustration that I usually insert into my note articles, I suddenly came across the words "Nano Banana 2." Huh? Wasn't it called Nano Banana Pro? Suddenly it becomes "2". Why? Since when?&lt;/p&gt;

&lt;p&gt;Upon further investigation, I discovered that on February 26, 2026, Google suddenly announced its latest image-generation AI model. This model was announced in a surprise move by Google. I only noticed it the next day, and was blown away by how quickly it was released…!&lt;/p&gt;

&lt;p&gt;That's Nano Banana 2. I tried it out right away and was simply blown away by the speed of its generation and the degree of evolution. The Nano Banana Pro I've been using until now is good, but the "2" isn't bad either.&lt;/p&gt;

&lt;p&gt;The cost of generating each image has been significantly reduced to about half that of Nano Banana Pro, and resolutions up to 4K are supported. There have also been improvements in practical aspects, including more accurate text rendering and greater character consistency.&lt;/p&gt;

&lt;p&gt;So, let me give you a quick demo of the live chatbot to show you how everything works.&lt;/p&gt;

&lt;p&gt;Check &lt;a href="https://www.youtube.com/watch?v=oq6yew2Rl3w&amp;amp;t=251s" rel="noopener noreferrer"&gt;Video&lt;/a&gt; :&lt;/p&gt;

&lt;p&gt;I'll start with the sidebar. There are three main settings. First, Resolution controls the size of the generated image. Higher resolution gives you better quality, but it also makes the API calls slower and more expensive. &lt;/p&gt;

&lt;p&gt;Second, Text Context decides whether the full extracted text of the PDF gets added to the prompt. When this option is on, the model can read the entire document and better understand the content before making edits.&lt;/p&gt;

&lt;p&gt;In Edit Mode, you choose the pages you want to change and write a prompt for each page. You can add as many page–prompt pairs as you want. If you add the same page more than once, the agent automatically merges the prompts into a single instruction.&lt;/p&gt;

&lt;p&gt;You can also select style reference pages before running the edits. These are pages from the same PDF that Gemini uses as a visual guide. This helps the edited slides keep the same fonts, colors, and layout as the rest of the document.&lt;/p&gt;

&lt;p&gt;When you click Run, the agent converts each selected page into a high-resolution image using a tool called Poppler. Then it sends all page edits to Google Gemini at the same time in parallel. That means editing five pages usually takes about the same time as editing just one.&lt;/p&gt;

&lt;p&gt;Gemini receives the page image, the style reference images, your prompt, and optionally the full document text. It processes all of this information and generates a new image of the slide with your requested changes. Sometimes it also returns a short note explaining what it modified.&lt;/p&gt;

&lt;p&gt;Once Gemini returns the updated image, the agent runs Tesseract OCR. Tesseract scans the image and embeds a hidden text layer behind it. This turns the image back into a real PDF page, so you can still search, highlight, and copy text from it.&lt;/p&gt;

&lt;p&gt;As each page finishes, the agent shows a side-by-side preview in the UI. You can immediately compare the original page with the edited version and see exactly what changed before downloading anything.&lt;/p&gt;

&lt;p&gt;After all pages are processed, the agent rebuilds the full PDF. It goes through every page of the original document and replaces only the edited ones. Each replacement keeps the same dimensions as the original page, so the layout stays perfectly aligned.&lt;/p&gt;

&lt;p&gt;In Add Mode, instead of editing a page, you create a brand-new slide. You choose where to insert it and describe what you want it to look like. The system then generates the slide from scratch using your style references as a visual guide. If you don't select any style references, the system automatically uses page 1 of the document.&lt;/p&gt;

&lt;p&gt;The generated slide follows the same workflow. Tesseract adds a searchable text layer, the agent inserts the slide into the correct position in the PDF, and you get a preview before downloading.&lt;br&gt;
This code will be available on my Patreon because it took me a lot of time and effort. If you enjoy what I create and want to see more projects like this, supporting me on Patreon helps me keep making high-quality content. I would truly appreciate your support&lt;/p&gt;
&lt;h1&gt;
  
  
  Why pair Claude with Nano Banana?
&lt;/h1&gt;

&lt;p&gt;Claude is an excellent text and code generation AI, but it cannot generate images by itself. On the other hand, Nano Banana is good at image generation but has limitations in managing complex contexts and iterative improvement instructions.&lt;/p&gt;

&lt;p&gt;Combining the two:&lt;/p&gt;

&lt;p&gt;Claude understands your intent and generates the optimal prompt → Nano Banana outputs an image&lt;br&gt;
Claude evaluates the generated results and identifies problems → Autonomously regenerates and corrects them&lt;br&gt;
Claude maintains context during long sessions → maintains consistent workflow&lt;/p&gt;

&lt;p&gt;In fact, when developers tried it, they were able to complete a project that involved repeatedly generating over 100 app icons for around $45.&lt;/p&gt;
&lt;h1&gt;
  
  
  How Nano Banana 2 works
&lt;/h1&gt;

&lt;p&gt;Nano Banana 2 uses a Multimodal Diffusion Transformer (MMDiT) architecture with a parameter scale of approximately 1.8 billion (1.8B) and Dynamic Quantisation-Aware Training (DQAT) to minimise memory footprint while maintaining high output quality.&lt;/p&gt;

&lt;p&gt;Grouped-Query Attention (GQA) is introduced to speed up inference.&lt;br&gt;
GQA is a technology that significantly reduces the amount of data movement during inference by sharing key-value pairs across groups. This allows it to run continuously without thermal throttling, even on the NPU of a mobile device.&lt;/p&gt;

&lt;p&gt;Furthermore, instead of simple pattern matching like the original Nano Banana, the new Nano uses a multi-stage loop of "Plan → Evaluate → Improve ." First, it analyses the prompt's intent and creates a generation plan. &lt;/p&gt;

&lt;p&gt;Next, it performs character-by-character verification of the text and checks the consistency of spatial placement. If there are any problems, they are improved before proceeding to finalize the pixels. &lt;br&gt;
This loop enables complex multi-object scenes and accurate text rendering.&lt;/p&gt;
&lt;h1&gt;
  
  
  What has changed the most? Three points
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;The biggest improvement is the ability to browse web information in real time. Gemini performs web searches and generates real-time information and images while adding a new feature called "World Knowledge." &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This feature, not available in Nano Banana Pro, allows for more accurate depictions of real-world places, people, and products. It seems to work particularly well with infographics and illustrations.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Further improvements in text rendering: Text rendering was already quite good in Nano Banana Pro, but Nano Banana 2 introduces a new system that verifies each character in a three-step loop: "Plan → Evaluate → Improve." &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even when Chinese and numbers are mixed, the text no longer breaks down, and the improvements are noticeable when you try it out.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;4K Support Exceeds the Pro Limit. While the Nano Banana Pro's maximum resolution was 2K, the Nano Banana 2 now supports 4K. 
The number of aspect ratios has also increased to 14 (including 9:16 and 21:9), making it suitable for everything from social media posts to cinematic banners.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;🤔 So which one should I use?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The conclusion of the usage seems to be something like this.&lt;br&gt;
When to use Nano Banana 2&lt;/p&gt;

&lt;p&gt;I want to create AI illustrations for posting on SNS and notes.&lt;br&gt;
I want to generate high-quality images for free.&lt;br&gt;
4K resolution required&lt;br&gt;
I want to create accurate illustrations and infographics that reference web information.&lt;br&gt;
I want to generate a large amount of data quickly.&lt;/p&gt;

&lt;p&gt;Situations where Nano Banana Pro is recommended&lt;br&gt;
Highest quality photorealism required&lt;br&gt;
Complex commercial creative production&lt;br&gt;
Tasks that require professional-level precision&lt;/p&gt;

&lt;p&gt;It seems like the Nano Banana 2 will be able to handle most of my everyday creative and AI illustration needs. I think the Pro is more of a trump card for when I really need it!&lt;/p&gt;
&lt;h1&gt;
  
  
  Let's start coding:
&lt;/h1&gt;

&lt;p&gt;I create an extract_full_text function that reads a PDF file and pulls out all the text inside it. First, it runs a fast external tool that converts the PDF into plain text while keeping the page layout as close as possible to the original slides. &lt;br&gt;
After that, the text is split into separate pages using a special page-break marker. The function then goes through each page one by one and skips any pages that are empty. &lt;/p&gt;

&lt;p&gt;Next, it cleans the text by removing extra spaces at the beginning and end. If a page has more than 2000 characters, the text is cut down and marked as truncated so it stays shorter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def extract_full_text(pdf_path: str) -&amp;gt; str:
    """Extracts the full text from a PDF using pdftotext (via subprocess for speed/layout)."""
    try:
        # Using -layout to preserve some spatial structure which is good for slides
        result = subprocess.run(
            ['pdftotext', '-layout', pdf_path, '-'],
            capture_output=True,
            text=True,
            check=True
        )
        raw_text = result.stdout

        # Split by form feed to get pages
        pages = raw_text.split('\f')

        formatted_pages = []
        for i, page_text in enumerate(pages):
            # Skip empty pages at the end if any
            if not page_text.strip():
                continue

            # Strip whitespace
            clean_text = page_text.strip()

            # Truncate to 2000 chars
            if len(clean_text) &amp;gt; 2000:
                clean_text = clean_text[:2000] + "...[truncated]"

            # Wrap in page tags (1-indexed)
            formatted_pages.append(f"&amp;lt;page-{i+1}&amp;gt;\n{clean_text}\n&amp;lt;/page-{i+1}&amp;gt;")

        return "&amp;lt;document_context&amp;gt;\n" + "\n".join(formatted_pages) + "\n&amp;lt;/document_context&amp;gt;"
    except subprocess.CalledProcessError as e:
        print(f"Error extracting text: {e}")
        return ""
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that i made a function that converts an image into a single-page PDF. It uses an OCR tool to read the text in the image and create a PDF that includes a hidden text layer. This hidden text makes the PDF searchable and easier to process later. Then it saves the generated PDF file to the location you provided.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def rehydrate_image_to_pdf(image: Image.Image, output_pdf_path: str):
    """
    Converts an image to a single-page PDF with a hidden text layer using Tesseract.
    This is the 'State Preservation' step.
    """
    pdf_bytes = pytesseract.image_to_pdf_or_hocr(image, extension='pdf')
    with open(output_pdf_path, 'wb') as f:
        f.write(pdf_bytes)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, I create a function that replaces specific pages in a PDF while keeping the rest of the document the same. First, it opens the original PDF and prepares a new file where the final version will be saved. Then it goes through each page in the document one by one. &lt;/p&gt;

&lt;p&gt;If a page number appears in the replacement list, the function loads the new page that should replace it. It checks the size of the original page and resizes the new page so both pages match in width and height. &lt;/p&gt;

&lt;p&gt;After that, the new page is added to the output document instead of the old one. If the page does not need replacement, the original page is simply copied to the new file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def batch_replace_pages(original_pdf_path: str, replacements: dict[int, str], output_pdf_path: str):
    """
    Replaces multiple pages in the original PDF.
    replacements: dict mapping page_number (1-indexed) -&amp;gt; path_to_new_single_page_pdf
    """
    reader = PdfReader(original_pdf_path)
    writer = PdfWriter()

    for i in range(len(reader.pages)):
        page_num = i + 1
        if page_num in replacements:
            # This page needs replacement
            original_page = reader.pages[i]
            original_width = original_page.mediabox.width
            original_height = original_page.mediabox.height

            new_pdf_path = replacements[page_num]
            new_reader = PdfReader(new_pdf_path)
            new_page = new_reader.pages[0]

            # Resize new page to match original dimensions
            new_page.scale_to(width=float(original_width), height=float(original_height))

            writer.add_page(new_page)
        else:
            # Keep original page
            writer.add_page(reader.pages[i])

    with open(output_pdf_path, 'wb') as f:
        writer.write(f)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next i made a function that adds a new page into an existing PDF at a specific position. First, it opens the original PDF and prepares a new document where the final version will be saved. &lt;/p&gt;

&lt;p&gt;It then checks the size of the first page so the new page can match the same width and height. After that, the function loads the new page and resizes it to match the document's page size. If the position is set to 0, the new page is inserted at the beginning of the document. &lt;/p&gt;

&lt;p&gt;Otherwise, the function goes through each page and inserts the new page right after the chosen page number. Finally, the updated PDF with the inserted page is saved to the output file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def insert_page(original_pdf_path: str, new_page_pdf_path: str, after_page: int, output_pdf_path: str):
    """
    Inserts a new page into the PDF after the specified page number.
    after_page: 0 to insert at the beginning, or page number (1-indexed) to insert after.
    """
    reader = PdfReader(original_pdf_path)
    writer = PdfWriter()

    # Get dimensions from the first page as reference
    reference_page = reader.pages[0]
    ref_width = reference_page.mediabox.width
    ref_height = reference_page.mediabox.height

    # Load the new page
    new_reader = PdfReader(new_page_pdf_path)
    new_page = new_reader.pages[0]
    new_page.scale_to(width=float(ref_width), height=float(ref_height))

    # Insert at beginning
    if after_page == 0:
        writer.add_page(new_page)

    # Add all original pages, inserting the new one at the right position
    for i in range(len(reader.pages)):
        writer.add_page(reader.pages[i])
        # Insert after this page if it matches
        if i + 1 == after_page:
            writer.add_page(new_page)

    with open(output_pdf_path, 'wb') as f:
        writer.write(f)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally i made a function to generate a new slide image using a user prompt and optional style references. It sends the instructions to an AI model, which creates the image and optional text. The function then extracts the generated image and text and returns them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def generate_new_slide(
    style_reference_images: List[Image.Image],
    user_prompt: str,
    full_text_context: str = "",
    resolution: str = "4K",
    enable_search: bool = False
) -&amp;gt; Tuple[Image.Image, Optional[str]]:
    """
    Generates a completely new slide based on style references and a prompt.
    Returns a tuple of (generated PIL Image, optional text response).
    """
    client = get_client()

    # Construct the prompt
    prompt_parts = []

    prompt_parts.append(user_prompt)

    if style_reference_images:
        prompt_parts.append("Match the visual style (fonts, colors, layout) of these reference images:")
        for img in style_reference_images:
            prompt_parts.append(img)

    if full_text_context:
        prompt_parts.append(f"DOCUMENT CONTEXT:\n{full_text_context}\n")

    # Build config - allow both text and image output
    config = types.GenerateContentConfig(
        response_modalities=['TEXT', 'IMAGE'],
        image_config=types.ImageConfig(
            image_size=resolution
        )
    )
    if enable_search:
        config.tools = [{"google_search": {}}]

    # Call the model
    try:
        response = client.models.generate_content(
            model='gemini-3-pro-image-preview',
            contents=prompt_parts,
            config=config
        )
    except Exception as e:
        error_msg = str(e).lower()
        if "quota" in error_msg or "billing" in error_msg or "payment" in error_msg:
            raise RuntimeError(
                "Gemini API Error: This tool requires a PAID API key with billing enabled.\n"
                "Free tier keys do not support image generation. Please:\n"
                "1. Visit https://aistudio.google.com/api-keys\n"
                "2. Enable billing on your Google Cloud project\n"
                f"Original error: {e}"
            )
        elif "api key" in error_msg or "authentication" in error_msg or "unauthorized" in error_msg:
            raise RuntimeError(
                "Gemini API Error: Invalid API key.\n"
                "Please check that your GEMINI_API_KEY environment variable is set correctly.\n"
                f"Original error: {e}"
            )
        else:
            raise RuntimeError(f"Gemini API Error: {e}")

    # Extract image and text from the response
    generated_image = None
    response_text = None
    if response.candidates and response.candidates[0].content.parts:
        for part in response.candidates[0].content.parts:
            if part.inline_data:
                # Convert bytes to PIL Image
                from io import BytesIO
                generated_image = Image.open(BytesIO(part.inline_data.data))
            elif part.text:
                response_text = part.text

    if not generated_image:
        raise RuntimeError("No image generated by the model.")

    return generated_image, response_text
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What I thought after using it in the morning&lt;/p&gt;

&lt;p&gt;To be honest, I thought "Pro is enough, isn't it?", but when I actually tried using them side by side, the difference was greater than I expected. &lt;/p&gt;

&lt;p&gt;In particular, 4K support and web information reference are strengths unique to Nano Banana 2, and I think they are improvements not found in Pro. Also, text comprehension has improved. &lt;/p&gt;

&lt;p&gt;In fact, today's thumbnail and illustration were both created in 2 and generated in one go.&lt;/p&gt;

&lt;p&gt;The fact that the cost has been halved is also a welcome update, especially for AI generation users who like to experiment a lot. &lt;br&gt;
Gemini has an advantage in terms of generation speed compared to Midjourney, and the fact that it's free to use is one of its strengths.&lt;/p&gt;

&lt;p&gt;I would highly appreciate it if you&lt;/p&gt;

&lt;p&gt;❣ Join my Patreon: &lt;a href="https://www.patreon.com/GaoDalie_AI" rel="noopener noreferrer"&gt;https://www.patreon.com/GaoDalie_AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Book an Appointment with me: &lt;a href="https://topmate.io/gaodalie_ai" rel="noopener noreferrer"&gt;https://topmate.io/gaodalie_ai&lt;/a&gt;&lt;br&gt;
Support the Content (every Dollar goes back into the video):&lt;a href="https://buymeacoffee.com/gaodalie98d" rel="noopener noreferrer"&gt;https://buymeacoffee.com/gaodalie98d&lt;/a&gt;&lt;br&gt;
Subscribe to the Newsletter for free: &lt;a href="https://substack.com/@gaodalie" rel="noopener noreferrer"&gt;https://substack.com/@gaodalie&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>RLM: The Ultimate Evolution of AI? Recursive Language Models</title>
      <dc:creator>Gao Dalie (Ilyass)</dc:creator>
      <pubDate>Tue, 13 Jan 2026 17:52:19 +0000</pubDate>
      <link>https://dev.to/gaodalie_ai/rlm-the-ultimate-evolution-of-ai-recursive-language-models-3h8o</link>
      <guid>https://dev.to/gaodalie_ai/rlm-the-ultimate-evolution-of-ai-recursive-language-models-3h8o</guid>
      <description>&lt;p&gt;During the weekend, I scrolled through Twitter to see what was happening in the AI community. MIT has just released a groundbreaking paper that addresses a significant issue with large language models.&lt;/p&gt;

&lt;p&gt;It sounds very academic, but here’s the simple version: essentially, if you have AI act a second time, the results can be remarkable.&lt;/p&gt;

&lt;p&gt;Over the past two years, almost all mainstream large-scale models have been racing to expand their context windows. Gemini has increased its window size to the millions, the GPT series continues to increase its investment, and Llama has even proclaimed a goal of tens of millions of tokens.&lt;/p&gt;

&lt;p&gt;On the surface, this is an arms race of “who can fill the most space.” But the problem is that increasing the context window does not mean that the model can actually “read in and remember” all the content.&lt;/p&gt;

&lt;p&gt;Another popular approach is Retrieval-Augmented Generation (RAG), which first segments long documents into chunks and stores them in a vector database, then retrieves relevant segments based on the question and feeds them to the model.&lt;/p&gt;

&lt;p&gt;This avoids having the model consume the entire long document at once, but its effectiveness is highly dependent on the quality of the retrieval, and it often struggles with questions that require comprehensive information from the entire text.&lt;/p&gt;

&lt;p&gt;However, these methods all share a common problem: they assume that the model is passive. The model can only wait for humans to organize, segment, and feed it information. True intelligence shouldn’t be like this.&lt;/p&gt;

&lt;p&gt;MIT have proposed a disruptive idea: why not let the model read itself? Look it up itself? Slice it itself? Call itself?&lt;/p&gt;

&lt;p&gt;Thus, Recursive Language Models (RLM) were born.&lt;/p&gt;

&lt;p&gt;RLM’s core insight is very simple, yet revolutionary: it transforms the context from “input” to “environment”.&lt;/p&gt;

&lt;p&gt;The model no longer receives a long string of tokens, but instead, like a program, treats the entire context as a variable within a REPL (Read-Eval-Print Loop) environment, allowing it to view, slice, search, filter, and recursively call itself at any time. It is no longer “fed information,” but rather “actively explores information.”&lt;/p&gt;

&lt;p&gt;It’s like going from “Here’s a book for you to read” to “Here’s a library for you to search, dissect, summarise, and use your own assistants.”&lt;/p&gt;

&lt;p&gt;This not only bypasses the context constraints of Transformer, but also gives the model the ability to “procedurally access the world” for the first time.&lt;/p&gt;

&lt;p&gt;So, let me give you a quick demo of a live chatbot to show you what I mean.&lt;/p&gt;

&lt;p&gt;Check a &lt;a href="https://www.youtube.com/watch?v=JF13pSE0KLA&amp;amp;t=1s" rel="noopener noreferrer"&gt;video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We're going to ask a question: “ Print me the first 100 powers of two, each on a newline”&lt;/p&gt;

&lt;p&gt;If you see how the chatbot generates output, you’ll see that the agent processes the full input, which can be millions of tokens, loaded into a Python REPL environment as a variable; the agent does not read this text directly. Instead, it treats the input as an environment it can operate on.&lt;/p&gt;

&lt;p&gt;First, the model performs exploration and inspection. It prints small slices of the context, checks structure, looks for headers, patterns, or repeated phrases, and uses tools like string slicing and regular expressions to understand how the data is organised. This step replaces passive reading with active scanning.&lt;/p&gt;

&lt;p&gt;Next, the model applies programmatic filtering and indexing. Using Python methods such as split(), find(), re.findall(), loops, and conditionals, it narrows the massive input down to only the parts that matter for the task. Noise is discarded early, which prevents context overload.&lt;/p&gt;

&lt;p&gt;Once relevant sections are identified, the model performs task decomposition. It breaks the main problem into smaller, well-defined subtasks. Each subtask fits comfortably within a normal model context window. Humans do not predefine this decomposition — the model decides how to split the problem based on what it discovers during exploration.&lt;/p&gt;

&lt;p&gt;Then comes the key step: recursive self-calls. For each subtask, the model calls itself (or a smaller helper model) to process that chunk. These calls form a tree of reasoning, not a single chain. Each call returns a partial result, which is stored in variables inside the REPL environment.&lt;/p&gt;

&lt;p&gt;After sub-results are collected, the model performs aggregation and synthesis. It uses Python logic to combine summaries, compare results, compute pairwise relationships, or assemble structured outputs like lists, tables, or long documents.&lt;/p&gt;

&lt;p&gt;The model then applies verification and self-checking. It may re-run parts of the analysis, cross-check results with another recursive call, or validate logic using code. This creates multi-pass reasoning similar to human double-checking.&lt;/p&gt;

&lt;p&gt;Finally, the model constructs the final output. Instead of being limited by token output size, it builds the answer piece by piece in variables and then returns the assembled result. This allows extremely long, structured outputs that traditional LLMs cannot produce.&lt;/p&gt;

&lt;h1&gt;
  
  
  What makes RLM special?
&lt;/h1&gt;

&lt;p&gt;Press enter or click to view the image in full size&lt;/p&gt;

&lt;p&gt;Recursive Language Models (RLMs) are special because they change an AI from a passive reader into an active problem-solver. Instead of trying to understand a huge input all at once, an RLM treats the input like a workspace it can explore, search, and break apart using code.&lt;/p&gt;

&lt;p&gt;It decides what to read, how to slice the information, and when to call itself again to solve smaller pieces. By using programmatic access, recursion, and self-checking, it avoids getting confused by long or complex inputs and stays stable even as tasks grow harder.&lt;/p&gt;

&lt;p&gt;This lets RLM handle massive contexts, high-complexity reasoning, and long structured outputs in a way traditional language models simply can’t.&lt;/p&gt;

&lt;h1&gt;
  
  
  How exactly does RLM work?
&lt;/h1&gt;

&lt;p&gt;Press enter or click to view image in full size&lt;/p&gt;

&lt;p&gt;Traditional LLMs work simply: you feed in a long string of tokens, and it gives you an answer in a single forward inference.&lt;/p&gt;

&lt;p&gt;But when the context length exceeds hundreds of thousands or millions, this approach is like asking someone to read “War and Peace” in one go before answering a question — it’s bound to break down.&lt;/p&gt;

&lt;p&gt;RLM’s approach is completely different.&lt;/p&gt;

&lt;p&gt;It loads the entire long context into a Python REPL environment as a variable, such as &lt;code&gt;context&lt;/code&gt;. The model no longer directly “eats” these tokens; instead, it accesses them by writing code, much like a programmer.&lt;/p&gt;

&lt;p&gt;This means that for the first time, the model has a “tool.” It can:&lt;/p&gt;

&lt;p&gt;To view a specific segment: print(context[:500])&lt;/p&gt;

&lt;p&gt;Search keyword: re.findall(“festival”, context)&lt;/p&gt;

&lt;p&gt;Split by chapter: part1, part2 = context.split(“Chapter 2”)&lt;/p&gt;

&lt;p&gt;Constructing a subtask: sub_answer = llm_query(f”Please summarize {part1}”)&lt;/p&gt;

&lt;p&gt;It can even recursively call itself: result = rlm_query(sub_prompt)&lt;/p&gt;

&lt;p&gt;This is like giving the model “hands” and “eyes”. It is no longer a passive language generator, but an intelligent agent that can actively explore, actively deconstruct, and actively plan.&lt;/p&gt;

&lt;p&gt;The examples in the study are very vivid. The model will first print the first 100 lines to check the structure before deciding how to slice them; it will use keywords to filter out potentially related paragraphs; it will break down the task into multiple sub-problems and then recursively call itself to solve them.&lt;/p&gt;

&lt;p&gt;This isn’t prompt engineering; it’s program engineering.&lt;/p&gt;

&lt;h1&gt;
  
  
  What’s the limitation of RLM?
&lt;/h1&gt;

&lt;p&gt;The main limitation of RLM is that its power comes with overhead and complexity. When the input is short and the task is simple, using the base model directly is often faster and more efficient, since RLM adds extra steps like environment interaction and recursive calls.&lt;/p&gt;

&lt;p&gt;In its current form, RLM relies on synchronous, blocking sub-model calls, which increases end-to-end latency and can slow down responses. The paper also notes that system prompts are fixed and not tailored to different task types, leaving performance gains on the table.&lt;/p&gt;

&lt;p&gt;Finally, letting the model write and execute code inside a REPL introduces real engineering challenges, especially around security isolation, safety, and predictable behavior.&lt;/p&gt;

&lt;p&gt;In short, RLM is powerful for hard, large-scale problems, but it is heavier, slower, and more complex than standard models for simple tasks.&lt;/p&gt;

&lt;h1&gt;
  
  
  My impression :
&lt;/h1&gt;

&lt;p&gt;RLM represents a shift from “how do we compress context?” to “how do we teach models to actively manage context like a skilled developer?”&lt;/p&gt;

&lt;p&gt;Instead of fighting context limits with bigger windows or lossy summaries, RLMs embrace the constraint and learn to work within it — delegating, filtering, and focusing programmatically. It’s scaffolding that scales with learning, not just engineering.&lt;/p&gt;

&lt;p&gt;I would highly appreciate it if you&lt;/p&gt;

&lt;p&gt;❣ Join my Patreon: &lt;a href="https://www.patreon.com/GaoDalie_AI" rel="noopener noreferrer"&gt;https://www.patreon.com/GaoDalie_AI&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Book an Appointment with me: &lt;a href="https://topmate.io/gaodalie_ai" rel="noopener noreferrer"&gt;https://topmate.io/gaodalie_ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Support the Content (every Dollar goes back into the video):&lt;a href="https://buymeacoffee.com/gaodalie98d" rel="noopener noreferrer"&gt;https://buymeacoffee.com/gaodalie98d&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Subscribe to the Newsletter for free: &lt;a href="https://substack.com/@gaodalie" rel="noopener noreferrer"&gt;https://substack.com/@gaodalie&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>DSPy 3 + GEPA: The Most Advanced RAG Framework Yet — Auto Reasoning &amp; Prompting</title>
      <dc:creator>Gao Dalie (Ilyass)</dc:creator>
      <pubDate>Fri, 26 Dec 2025 09:02:59 +0000</pubDate>
      <link>https://dev.to/gaodalie_ai/dspy-3-gepa-the-most-advanced-rag-framework-yet-auto-reasoning-prompting-32mi</link>
      <guid>https://dev.to/gaodalie_ai/dspy-3-gepa-the-most-advanced-rag-framework-yet-auto-reasoning-prompting-32mi</guid>
      <description>&lt;p&gt;Last week, OpenAI experienced a sudden surge in the middle of the night and went into a panic. GPT-5.2 has been released, and the global AI throne has changed hands once again.&lt;/p&gt;

&lt;p&gt;A major update in just about four months is unusual. The trigger was competitive pressure. Reuters reports that Altman released “code red” in early December to accelerate development, and that the background to this was responding to Google’s Gemini 3.&lt;/p&gt;

&lt;p&gt;OpenAI itself also positions it as “(rather than new features) we have improved performance in areas such as intelligence, code processing, and long-form text comprehension, and are particularly strong at creating spreadsheets, creating presentations, and other complex, multi-step tasks.&lt;/p&gt;

&lt;p&gt;In other words, GPT-5.2 is not a “major update,” but rather a refined version that enhances reliability, long-term context, tool execution, and output generation for practical applications. It’s safe to say that it’s not a new toy, but rather a work tool that’s become easier to use.&lt;/p&gt;

&lt;p&gt;In recent years, “agentic AI” has been performing a complex series of actions, with the LLM invoking tools, making inferences, and finally providing a final answer. To optimise these actions, the standard approach has been to use reinforcement learning (RL) to “learn good actions with rewards.” But the problem is-&lt;/p&gt;

&lt;p&gt;RL only provides a simple scalar reward, “whether the answer is correct or not,” making learning extremely inefficient.&lt;/p&gt;

&lt;p&gt;Additionally, fine-tuning a model requires extensive rollout and computational costs.&lt;/p&gt;

&lt;p&gt;Last year, I created a video about DSPy, and since then, it has made significant progress. At its core, DSPy treats language models as unique “devices,” similar to CPUs and GPUs used in deep learning.&lt;/p&gt;

&lt;p&gt;In DSPy, you only need to declare the required “Natural Language Signatures,” without worrying about the specific details of the Prompt implementation (in fact, after a year of practice, we found that worrying about those details is largely meaningless and doesn’t change the fact that LLM outputs are unstable).&lt;/p&gt;

&lt;p&gt;DSPy can be understood as follows: based on these signatures, DSPy can automatically generate, optimise, and fine-tune the Prompt, ultimately outputting results that meet expectations.&lt;/p&gt;

&lt;p&gt;The GEPA’s Idea: Encouraging LLMs to “Reflect on Their Own Failures Instead of using reinforcement learning, GEPA (Genetic-Pareto Prompt Optimizer) takes an approach whereby LLMs themselves analyze their own behavior in natural language and suggest how to improve next time. In other words, instead of tweaking the model’s parameters, we reflect on and evolve the “prompt” itself.&lt;/p&gt;

&lt;p&gt;So, let me give you a quick demo of a live chatbot to show you what I mean.&lt;/p&gt;

&lt;p&gt;Link to &lt;a href="https://www.youtube.com/watch?v=RyQNqzuAVcs" rel="noopener noreferrer"&gt;Video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I will prepare the SPACE_KNOWLEDGE. This technique is an alternative way to train the model that outperforms reinforcement learning, and asks a question about space: “ Which space telescope is most powerful?” If you look at how the chatbot generates the output,&lt;/p&gt;

&lt;p&gt;you’ll see that the agent uses Term Frequency Inverse Document Frequency to calculate term frequency (how often a word appears in a document and how rare that word is across all documents, then uses cosine similarity to find which chunks are genuinely similar to your question rather than just having random word matches. Once the top three most relevant chunks are retrieved&lt;/p&gt;

&lt;p&gt;Then the Agent uses confidence-based RAG uses chain-of-thought to generate an answer plus a confidence level so it can honestly tell you “I don’t have enough information” instead of hallucinating, while the multi-hop RAG takes it further by first extracting bullet-pointed facts from the context, then synthesizing those facts into a comprehensive answer — this two-step process is crucial for complex questions that need you to combine information from multiple sources because it prevents the AI Agent from getting confused or missing connections.&lt;/p&gt;

&lt;p&gt;Now here’s where GEPA comes in as a game-changer: instead of manually tweaking prompts or using older optimizers like MIPROv2, GEPA uses genetic algorithms. It combines good prompts to make better ones.&lt;/p&gt;

&lt;p&gt;It utilises Pareto optimisation to maintain multiple effective prompts, rather than just one. It also utilises reflection, which it learns from mistakes by reading text feedback and making corrections. Over time, this helps GEPA automatically generate increasingly better prompts.&lt;/p&gt;

&lt;p&gt;It builds a prompt evolution tree. Each new improvement grows like a branch on a tree. Every branch keeps what worked before and adds a few improvements. Step by step, the prompts get closer to the best instructions for The RAG task. and it does this 35 times more efficiently than MIPROv2 while generating prompts that are 9 times shorter yet perform 10% better.&lt;/p&gt;

&lt;p&gt;What makes GPT-5.2 stand out?&lt;/p&gt;

&lt;p&gt;Let’s start with the most shocking data. One of the tests used to measure AI performance is called “ARC-AGI-2.”&lt;/p&gt;

&lt;p&gt;This is a test that requires solving abstract puzzles at first sight (inspiration), and does not rely on “looking for answers in past data (cheating).” In other words, it’s a test that measures “innate intelligence,” and take a look at this score. where you can see GPT-5.1: 17.6%, Gemini 3 Pro: 31.1%, GPT-5.2: 52.9% (+35.3 points!)&lt;/p&gt;

&lt;p&gt;This increase is crazy. It’s more than triple the score of the previous version, 5.1. It’s nearly double the score of Gemini.&lt;/p&gt;

&lt;p&gt;If previous AIs were like “geniuses who memorised textbooks word for word,” then GPT-5.2 has evolved into “geniuses who can solve difficult problems they’ve never seen before with ingenuity.” The common AI phrase, “I can’t do it because I wasn’t taught,” is becoming a thing of the past.&lt;/p&gt;

&lt;p&gt;The next metric worth noting is “GDPval .” This test measures how well a person can perform “real-world tasks” such as research, planning, and decision-making. GPT-5.1: 38.8%, Gemini 3 Pro: 53.5%, GPT-5.2: 70.9% (+32.1 points!)&lt;/p&gt;

&lt;p&gt;Again, the results are overwhelming. In 5.1, the AI ​​was a “newbie intern waiting for instructions,” but in 5.2, it has been promoted to the “manager who makes plans and manages projects”&lt;/p&gt;

&lt;p&gt;class. Those who have complained that “AI is smart, but difficult to use at work” will be amazed by the “on-the-job capabilities” of 5.2.&lt;/p&gt;

&lt;p&gt;What makes GEPA Unique?&lt;/p&gt;

&lt;p&gt;The core concept of GEPA originates from the essence of human learning — reflection.&lt;/p&gt;

&lt;p&gt;It’s not just about adding more instructions, but rather, like an experienced mentor, it examines past attempts, analyzes successes and shortcomings, and then proposes better solutions.&lt;/p&gt;

&lt;p&gt;GEPA constructs a prompt evolution tree, allowing each optimization to grow like a branch, accumulating improvements and gradually approaching the optimal prompt.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobijiul48leu33ywy3aa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobijiul48leu33ywy3aa.png" alt=" " width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Unlike traditional reinforcement learning (RL), GEPA leverages the reflective capabilities of language models, combined with domain-specific textual feedback, rather than relying solely on a single scalar metric.&lt;/p&gt;

&lt;p&gt;This is akin to giving the model “X-ray vision,” enabling it to notice small details in the task and produce strong results in just a few steps.&lt;/p&gt;

&lt;p&gt;Let’s start coding :&lt;/p&gt;

&lt;p&gt;Let us now explore the process step by step and unravel the answer to how to use the DSPy 3, GEPA Optimiser and Agentic RAG. We will install the libraries that support the model. For this, we will do a pip install requirements.&lt;/p&gt;

&lt;p&gt;I would like to inform you that the code I shared here is only a part of my code. If you would like the full folder, you can find it on my Patreon. This code took me a considerable amount of time&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
pip install requirements
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Term Frequency Inverse Document Frequency.
&lt;/h2&gt;

&lt;p&gt;So, I create a Term Frequency Inverse Document Frequency retriever to find the documents that best match a user’s question. First, it stores all the documents and breaks each one into simple lowercase words, removing punctuation so the text is clean and easy to compare.&lt;/p&gt;

&lt;p&gt;Next, it looks at all documents together and calculates how important each word is across the whole collection: words that appear in many documents become less important, while words that appear in only a few documents become more important.&lt;/p&gt;

&lt;p&gt;When a query comes in, it is cleaned and broken into words the same way, and each word is given a score based on how often it appears and how rare it is overall.&lt;/p&gt;

&lt;p&gt;The retriever then compares the query to every document by measuring how similar their word scores are, using a mathematical method that checks how closely they point in the same direction.&lt;/p&gt;

&lt;p&gt;Each document gets a similarity score, the documents are sorted from best match to worst, and finally, the top few most relevant documents are returned to the user.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class TFIDFRetriever:
    """
    TF-IDF (Term Frequency - Inverse Document Frequency) retriever.

    This is smarter than simple keyword matching because:
    - TF: Words that appear often in a document are important for that document
    - IDF: Words that appear in many documents are less important overall

    Example: "the" appears everywhere (low IDF), but "astronaut" is specific (high IDF)
    """

    def __init__(self, documents: list[str], k: int = 3):
        self.documents = documents
        self.k = k
        self.doc_tokens = [self._tokenize(doc) for doc in documents]
        self.idf = self._compute_idf()

    def _tokenize(self, text: str) -&amp;gt; list[str]:
        """Convert text to lowercase tokens, removing punctuation."""
        import re
        text = text.lower()
        tokens = re.findall(r'\b[a-z]+\b', text)
        return tokens

    def _compute_idf(self) -&amp;gt; dict[str, float]:
        """Compute IDF for all terms in the corpus."""
        doc_count = len(self.documents)
        term_doc_counts = Counter()

        for tokens in self.doc_tokens:
            unique_tokens = set(tokens)
            for token in unique_tokens:
                term_doc_counts[token] += 1

        idf = {}
        for term, count in term_doc_counts.items():
            # Standard IDF formula with smoothing
            idf[term] = math.log((doc_count + 1) / (count + 1)) + 1

        return idf

    def _compute_tfidf(self, tokens: list[str]) -&amp;gt; dict[str, float]:
        """Compute TF-IDF vector for a list of tokens."""
        tf = Counter(tokens)
        tfidf = {}
        for term, count in tf.items():
            tfidf[term] = count * self.idf.get(term, 1.0)
        return tfidf

    def _cosine_similarity(self, vec1: dict, vec2: dict) -&amp;gt; float:
        """Compute cosine similarity between two sparse vectors."""
        common_terms = set(vec1.keys()) &amp;amp; set(vec2.keys())
        if not common_terms:
            return 0.0

        dot_product = sum(vec1[t] * vec2[t] for t in common_terms)
        norm1 = math.sqrt(sum(v ** 2 for v in vec1.values()))
        norm2 = math.sqrt(sum(v ** 2 for v in vec2.values()))

        if norm1 == 0 or norm2 == 0:
            return 0.0

        return dot_product / (norm1 * norm2)

    def __call__(self, query: str) -&amp;gt; list[str]:
        """Retrieve top-k documents most similar to the query."""
        query_tokens = self._tokenize(query)
        query_vec = self._compute_tfidf(query_tokens)

        scores = []
        for i, doc_tokens in enumerate(self.doc_tokens):
            doc_vec = self._compute_tfidf(doc_tokens)
            score = self._cosine_similarity(query_vec, doc_vec)
            scores.append((score, i, self.documents[i]))

        # Sort by score descending
        scores.sort(key=lambda x: x[0], reverse=True)

        return [doc for score, idx, doc in scores[:self.k]]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Retrieve Argumentation Generation :
&lt;/h2&gt;

&lt;p&gt;After that, I created two methods to answer questions using retrieval augmentation generation. In the first one, the Agent takes a question, looks up the most relevant documents, joins them into one context, and then generates an answer while also reporting how confident it is.&lt;/p&gt;

&lt;p&gt;It saves the documents it used, so you can later see where the answer came from. The second system is made for harder questions that need more thinking.&lt;/p&gt;

&lt;p&gt;It first retrieves documents the same way, then pulls out only the important facts related to the question, and finally combines those facts to create a clear answer.&lt;/p&gt;

&lt;p&gt;It also keeps both the retrieved documents and the extracted facts, so you can inspect each step and understand how the final answer was built.&lt;/p&gt;

&lt;p&gt;class RAGWithConfidence(dspy.Module):&lt;br&gt;
    """RAG that reports its confidence in the answer."""&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def __init__(self, retriever):
    super().__init__()
    self.retriever = retriever
    self.generate = dspy.ChainOfThought(AnswerWithConfidence)

def forward(self, question: str):
    docs = self.retriever(question)
    context = "\n\n".join(docs)
    result = self.generate(context=context, question=question)
    result.retrieved_docs = docs
    return result
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class MultiHopRAG(dspy.Module):
    """
    Multi-hop RAG: Extract facts first, then synthesize an answer.

    This helps with complex questions that require combining information
    from multiple sources.
    """

    def __init__(self, retriever):
        super().__init__()
        self.retriever = retriever
        self.extract = dspy.Predict(ExtractFacts)
        self.synthesize = dspy.Predict(SynthesizeAnswer)

    def forward(self, question: str):
        # Step 1: Retrieve
        docs = self.retriever(question)
        context = "\n\n".join(docs)

        # Step 2: Extract relevant facts
        extraction = self.extract(context=context, question=question)

        # Step 3: Synthesize answer from facts
        result = self.synthesize(facts=extraction.facts, question=question)

        # Attach intermediate results for inspection
        result.retrieved_docs = docs
        result.extracted_facts = extraction.facts

        return result
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Reflective Prompt Evolution :
&lt;/h2&gt;

&lt;p&gt;Then I use GEPA learns and improve answers step by step. First, the metric checks the model’s answer against the expected answer. If the answer matches exactly, it gives a full score.&lt;/p&gt;

&lt;p&gt;If the answer is only partly correct, it gives a lower score and explains what is missing. If the answer is wrong, it gives a low score and clear feedback about the mistake.&lt;/p&gt;

&lt;p&gt;This feedback is important because GEPA reads it and learns how to improve future prompts. The simple RAG module then works by taking a question, retrieving related documents, joining them into one context, and generating an answer from that context.&lt;/p&gt;

&lt;p&gt;GEPA uses the scores and feedback from the metric to automatically evolve better prompts for this RAG system over time.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def gepa_metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    """
    GEPA metric function with feedback.

    GEPA is special because it can use textual feedback to guide evolution.
    This function returns both a score AND feedback about what went wrong.
    """
    expected = gold.expected_answer.lower()
    actual = pred.answer.lower() if hasattr(pred, 'answer') else ""

    # Check if the key information is in the answer
    if expected in actual:
        return 1.0  # Perfect match

    # Partial credit for relevant answers
    expected_words = set(expected.split())
    actual_words = set(actual.split())
    overlap = len(expected_words &amp;amp; actual_words) / len(expected_words) if expected_words else 0

    if overlap &amp;gt; 0.5:
        score = 0.7
        feedback = f"Partially correct. Expected '{gold.expected_answer}' but got related content."
    elif overlap &amp;gt; 0:
        score = 0.3
        feedback = f"Contains some relevant info but missing key details. Expected: '{gold.expected_answer}'"
    else:
        score = 0.0
        feedback = f"Incorrect. Expected answer to contain '{gold.expected_answer}' but got: '{actual[:100]}...'"

    # Return score with feedback for GEPA's reflection
    from dspy.teleprompt.gepa.gepa_utils import ScoreWithFeedback
    return ScoreWithFeedback(score=score, feedback=feedback)


class SimpleRAGForOptimization(dspy.Module):
    """A simple RAG module that GEPA will optimize."""

    def __init__(self, retriever):
        super().__init__()
        self.retriever = retriever
        self.generate = dspy.Predict("context, question -&amp;gt; answer")

    def forward(self, question: str):
        docs = self.retriever(question)
        context = "\n\n".join(docs)
        return self.generate(context=context, question=question)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  My Thoughts :
&lt;/h1&gt;

&lt;p&gt;GPT-5.2 may not be a model that can do “magical new things,” but it is a model that can change “tasks that you were previously unsure about entrusting to AI” into “tasks that you can entrust with confidence. “&lt;/p&gt;

&lt;p&gt;While future challenges remain, such as multimodal support, real-time optimisation, and safety assurance, these also represent significant development opportunities.&lt;/p&gt;

&lt;p&gt;Beyond 2025, GEPA is expected to lead to innovative applications such as self-correcting AI systems, neural-symbolic integration, and meta-prompt engineering. GEPA will undoubtedly continue to play a central role in the future of prompt technology.&lt;/p&gt;

&lt;p&gt;I would highly appreciate it if you&lt;/p&gt;

&lt;p&gt;❣ Join my Patreon: &lt;a href="https://www.patreon.com/GaoDalie_AI" rel="noopener noreferrer"&gt;https://www.patreon.com/GaoDalie_AI&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Book an Appointment with me: &lt;a href="https://topmate.io/gaodalie_ai" rel="noopener noreferrer"&gt;https://topmate.io/gaodalie_ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Support the Content (every Dollar goes back into the video):&lt;a href="https://buymeacoffee.com/gaodalie98d" rel="noopener noreferrer"&gt;https://buymeacoffee.com/gaodalie98d&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Subscribe to the Newsletter for free: &lt;a href="https://substack.com/@gaodalie" rel="noopener noreferrer"&gt;https://substack.com/@gaodalie&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>programming</category>
    </item>
    <item>
      <title>DeepSeek-V3.2 + DocLing + Agentic RAG: Parse Any Document with Ease</title>
      <dc:creator>Gao Dalie (Ilyass)</dc:creator>
      <pubDate>Mon, 15 Dec 2025 06:30:47 +0000</pubDate>
      <link>https://dev.to/gaodalie_ai/deepseek-v32-docling-agentic-rag-parse-any-document-with-ease-8bp</link>
      <guid>https://dev.to/gaodalie_ai/deepseek-v32-docling-agentic-rag-parse-any-document-with-ease-8bp</guid>
      <description>&lt;p&gt;If you’ve been following open-source logical modelling, you know it's become a highly competitive field. Every few months, a new model comes out and says it breaks old limits, and some of them truly do&lt;/p&gt;

&lt;p&gt;Just two days ago, after I quietly finished my exam and locked in, I was scrolling online late at night. DeepSeek, as always, sent a shockwave through the AI community.&lt;/p&gt;

&lt;p&gt;DeepSeek launched its latest model, built for agents, “DeepSeek-V3.2,” and its high-performance version&lt;/p&gt;

&lt;p&gt;These models have significantly improved their reasoning capabilities. combining technological innovations such as efficient sparse attention and large-scale reinforcement learning.&lt;/p&gt;

&lt;p&gt;DeepSeek-V3.2 can go head-to-head with GPT-5, while Speciale, combining long-term thinking and theorem-proving capabilities, performs comparable to Gemini-3.0-Pro. One reader commented, “This model shouldn’t be called V3.2; it should be called V4.&lt;/p&gt;

&lt;p&gt;In particular, the Speciale version achieved gold medal-level results at the 2025 IMO, IOI, and ICPC World Championships, placing in the top two at the ICPC World Championships and in the top ten at the IOI, achieving “Gold-medal performance.&lt;/p&gt;

&lt;p&gt;As part of my research and development, I needed to extract text data from PDFs as accurately as possible. In the past, I have extracted text from PDFs using PyMuPDF or the OCR engine Tesseract.&lt;/p&gt;

&lt;p&gt;These are powerful tools that have been used in many projects for many years. However, I encountered the following issue, possibly due to the PDF I was working with.&lt;/p&gt;

&lt;p&gt;Docling, an open source library developed by IBM Research, is an effective solution to these challenges. Docling is a powerful tool that can structure and convert documents, such as PDFs and Word files, into Markdown.&lt;/p&gt;

&lt;p&gt;So, let me give you a quick demo of a live chatbot to show you what I mean.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=AqPfp4vjbhk" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’ll upload an Ocean AI PDF and ask the chatbot a question: “What is Ocean AI, and why is Ocean AI different from OpenAI?”&lt;/p&gt;

&lt;p&gt;If you look at how the chatbot generates the output, you’ll see that the agent first runs a relevance check to determine whether the question is actually related to your uploaded documents. If it’s not relevant, the agent immediately rejects the question instead of generating a hallucinated answer.&lt;/p&gt;

&lt;p&gt;For relevant questions, the agent parses the documents into structured formats such as Markdown or JSON. Then perform hybrid retrieval using both BM25 keyword search and vector embeddings to find the most relevant sections, even across multiple documents.&lt;/p&gt;

&lt;p&gt;The Research Agent uses this retrieved content to generate an answer, and then the Verification Agent cross-checks the response against the original documents to confirm factual accuracy and catch unsupported claims or contradictions.&lt;/p&gt;

&lt;p&gt;If verification fails, a self-correction loop automatically re-runs retrieval and research with adjusted parameters until the answer passes all checks. Once the answer is fully verified, the agent returns it. If at any point the question is found to be unrelated to the uploaded content, the agent clearly tells you instead of hallucinating.&lt;/p&gt;

&lt;h1&gt;
  
  
  What makes DeepSeek-V3.2 Unique?
&lt;/h1&gt;

&lt;p&gt;Most powerful AI models face a common problem: as file length increases, model execution speed decreases significantly, and costs rise dramatically. This is because traditional models attempt to compare each word with all other words to understand the context.&lt;/p&gt;

&lt;p&gt;DeepSeek-V3.2 addresses this problem by introducing a new method called DeepSeek Sparse Attention (DSA). You can think of it as a researcher conducting research in a library:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Traditional method (intensive attention): Researchers read every book on the shelf, page by page, just to answer one question. While comprehensive, this method is extremely slow and requires immense effort.&lt;/li&gt;
&lt;li&gt;The new method (DeepSeek-V3.2): Researchers use a digital index (Lightning Indexer) to find key pages and read only those pages quickly. This method is just as accurate, but much faster.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  What makes Docling Unique?
&lt;/h1&gt;

&lt;p&gt;The biggest reason why Docling stands out from existing tools is that its design concept is based on collaboration with generative AI, particularly RAG (Retrieval Augmented Generation).&lt;/p&gt;

&lt;p&gt;Modern AI applications require more than just extracting text. For AI to deeply understand the content of a document and generate accurate answers, it needs to know its meaning, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is this sentence the “abstract” or the “conclusion” of the paper?&lt;/li&gt;
&lt;li&gt;This string of numbers is not just text but a “table,” so what does each cell mean?&lt;/li&gt;
&lt;li&gt;What “caption” accompanies this image?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While PyMuPDF and Tesseract extract text as “strings,” Docking uses the power of the Visual Language Model (VLM) to analyse these structures and relationships and output them as a “DoclingDocument” object with rich information.&lt;/p&gt;

&lt;p&gt;This structured data is the key to dramatically improving RAG’s retrieval and answer generation quality.&lt;/p&gt;

&lt;p&gt;Let’s Start Coding :&lt;/p&gt;

&lt;p&gt;Let us now explore step by step and unravel the answer to how to use the DeepSeek-V3.2 + DocLing + Agentic RAG. We will install the libraries that support the model. For this, we will do a pip install requirements&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install requirements
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed.&lt;/p&gt;

&lt;p&gt;DocumentConverter: A high-level Python class designed for converting documents into a structured DoclingDocument format.&lt;/p&gt;

&lt;p&gt;EnsembleRetriever: Ensemble retriever that aggregates and orders the results of multiple retrievers by using weighted Reciprocal Rank Fusion.&lt;br&gt;
**&lt;br&gt;
DocLing:**&lt;/p&gt;

&lt;p&gt;I created a VerificationAgent class that fact-checks AI-generated answers against source documents. In &lt;strong&gt;init&lt;/strong&gt;I instantiate a deepseek-v3.2 model with zero temperature for deterministic outputs and build a prompt template that asks the LLM to verify answers in 4 specific ways: whether claims are directly supported, what's unsupported, what contradicts, and if it's relevant, forcing a structured response format for consistent parsing.&lt;/p&gt;

&lt;p&gt;In check()I take the answer string and a list of Document objects, extract and concatenate all the document text into one context string, then create a LangChain pipeline (prompt → LLM → string parser) that I invoke with the answer and context to get back a verification report.&lt;/p&gt;

&lt;p&gt;I log both the report and context for debugging, re-raise any errors that occur, and return a dict containing the verification report text and the context string. The whole point is to catch hallucinations by checking if the RAG system's generated answer is actually supported by the source documents.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import os
import hashlib
import pickle
from datetime import datetime, timedelta
from pathlib import Path
from typing import List
from docling.document_converter import DocumentConverter
from langchain_text_splitters import MarkdownHeaderTextSplitter
from config import constants
from config.settings import settings
from utils.logging import logger

class DocumentProcessor:
    def __init__(self):
        self.headers = [("#", "Header 1"), ("##", "Header 2")]
        self.cache_dir = Path(settings.CACHE_DIR)
        self.cache_dir.mkdir(parents=True, exist_ok=True)

    def validate_files(self, files: List) -&amp;gt; None:
        """Validate the total size of the uploaded files."""
        total_size = sum(os.path.getsize(f.name) for f in files)
        if total_size &amp;gt; constants.MAX_TOTAL_SIZE:
            raise ValueError(f"Total size exceeds {constants.MAX_TOTAL_SIZE//1024//1024}MB limit")

    def process(self, files: List) -&amp;gt; List:
        """Process files with caching for subsequent queries"""
        self.validate_files(files)
        all_chunks = []
        seen_hashes = set()

        for file in files:
            try:
                # Generate content-based hash for caching
                with open(file.name, "rb") as f:
                    file_hash = self._generate_hash(f.read())

                cache_path = self.cache_dir / f"{file_hash}.pkl"

                if self._is_cache_valid(cache_path):
                    logger.info(f"Loading from cache: {file.name}")
                    chunks = self._load_from_cache(cache_path)
                else:
                    logger.info(f"Processing and caching: {file.name}")
                    chunks = self._process_file(file)
                    self._save_to_cache(chunks, cache_path)

                # Deduplicate chunks across files
                for chunk in chunks:
                    chunk_hash = self._generate_hash(chunk.page_content.encode())
                    if chunk_hash not in seen_hashes:
                        all_chunks.append(chunk)
                        seen_hashes.add(chunk_hash)

            except Exception as e:
                logger.error(f"Failed to process {file.name}: {str(e)}")
                continue

        logger.info(f"Total unique chunks: {len(all_chunks)}")
        return all_chunks

    def _process_file(self, file) -&amp;gt; List:
        """Original processing logic with Docling"""
        if not file.name.endswith(('.pdf', '.docx', '.txt', '.md')):
            logger.warning(f"Skipping unsupported file type: {file.name}")
            return []

        converter = DocumentConverter()
        markdown = converter.convert(file.name).document.export_to_markdown()
        splitter = MarkdownHeaderTextSplitter(self.headers)
        return splitter.split_text(markdown)

    def _generate_hash(self, content: bytes) -&amp;gt; str:
        return hashlib.sha256(content).hexdigest()

    def _save_to_cache(self, chunks: List, cache_path: Path):
        with open(cache_path, "wb") as f:
            pickle.dump({
                "timestamp": datetime.now().timestamp(),
                "chunks": chunks
            }, f)

    def _load_from_cache(self, cache_path: Path) -&amp;gt; List:
        with open(cache_path, "rb") as f:
            data = pickle.load(f)
        return data["chunks"]

    def _is_cache_valid(self, cache_path: Path) -&amp;gt; bool:
        if not cache_path.exists():
            return False

        cache_age = datetime.now() - datetime.fromtimestamp(cache_path.stat().st_mtime)
        return cache_age &amp;lt; timedelta(days=settings.CACHE_EXPIRE_DAYS)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RelevanceChecker&lt;/p&gt;

&lt;p&gt;I created a RelevanceChecker class that determines whether retrieved documents can answer a user's question by classifying them into three categories.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;init&lt;/strong&gt;, I initialize a deepseek-v3.2 model with the API key and create a prompt template that instructs the LLM to classify passages as "CAN_ANSWER" (fully answers), "PARTIAL" (mentions topic but incomplete), or "NO_MATCH" (doesn't discuss topic at all), with emphasis that any topic mention should be "PARTIAL" not "NO_MATCH". I built a LangChain chain by piping prompt → LLM → string parser.&lt;/p&gt;

&lt;p&gt;In the check() method, I take a question, a retriever object, and a k parameter (default 3) for how many top documents to analyse. I invoke the retriever with the question to get relevant chunks, returning "NO_MATCH" immediately if nothing comes back.&lt;/p&gt;

&lt;p&gt;I print debug info showing document count and 200-character previews of the top k chunks for visibility. I combine the top k document texts into one string with double newlines, invoke the LLM chain with the question and combined content, and get back a classification string.&lt;/p&gt;

&lt;p&gt;I validate the response is one of the three valid labels by converting to uppercase and checking against valid options, forcing "NO_MATCH" if the LLM returns something unexpected.&lt;br&gt;
Finally, I return the validated classification, giving me a clear signal about whether my retriever found usable documents or if I need to fall back to alternative methods like web search.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# agents/relevance_checker.py
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_deepseek import ChatDeepSeek
from config.settings import settings

class RelevanceChecker:
    def __init__(self):
        # self.llm = ChatOpenAI(api_key=settings.OPENAI_API_KEY, model="gpt-4o")
        self.llm = ChatDeepSeek(api_key=settings.DEEPSEEK_API_KEY, model="deepseek-chat")

        self.prompt = ChatPromptTemplate.from_template(
            """
            You are given a user question and some passages from uploaded documents.

            Classify how well these passages address the user's question. 
            Choose exactly one of the following responses (respond ONLY with that label):

            1) "CAN_ANSWER": The passages contain enough explicit info to fully answer the question.
            2) "PARTIAL": The passages mention or discuss the question's topic (e.g., relevant years, facility names)
            but do not provide all the data or details needed for a complete answer.
            3) "NO_MATCH": The passages do not discuss or mention the question's topic at all.

            Important: If the passages mention or reference the topic or timeframe of the question in ANY way,
            even if incomplete, you should respond "PARTIAL", not "NO_MATCH".

            Question: {question}
            Passages: {document_content}

            Respond ONLY with "CAN_ANSWER", "PARTIAL", or "NO_MATCH".
            """
        )

        self.chain = self.prompt | self.llm | StrOutputParser()

    def check(self, question: str, retriever, k=3) -&amp;gt; str:
        """
        1. Retrieve the top-k document chunks from the global retriever.
        2. Combine them into a single text string.
        3. Pass that text + question to the LLM chain for classification.

        Returns: "CAN_ANSWER" or "PARTIAL" or "NO_MATCH".
        """

        print(f"[DEBUG] RelevanceChecker.check called with question='{question}' and k={k}")

        # Retrieve doc chunks from the retriever
        top_docs = retriever.invoke(question)[:k]  # Only use top k docs
        if not top_docs:
            print("[DEBUG] No documents returned from retriever.invoke(). Classifying as NO_MATCH.")
            return "NO_MATCH"

        print(f"[DEBUG] Retriever returned {len(top_docs)} docs.")

        # Show a quick snippet of each chunk for debugging
        for i, doc in enumerate(top_docs):
            snippet = doc.page_content[:200].replace("\n", "\\n")
            print(f"[DEBUG] Chunk #{i+1} preview (first 200 chars): {snippet}...")

        # Combine the top k chunk texts into one string
        document_content = "\n\n".join(doc.page_content for doc in top_docs)
        print(f"[DEBUG] Combined text length for top {k} chunks: {len(document_content)} chars.")

        # Call the LLM
        response = self.chain.invoke({
            "question": question, 
            "document_content": document_content
        }).strip()

        print(f"[DEBUG] LLM raw classification response: '{response}'")

        # Convert to uppercase, check if it's one of our valid labels
        classification = response.upper()
        valid_labels = {"CAN_ANSWER", "PARTIAL", "NO_MATCH"}
        if classification not in valid_labels:
            print("[DEBUG] LLM did not respond with a valid label. Forcing 'NO_MATCH'.")
            classification = "NO_MATCH"
        else:
            print(f"[DEBUG] Classification recognized as '{classification}'.")

        return classification
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  ResearchAgent
&lt;/h1&gt;

&lt;p&gt;I created a ResearchAgent class that generates answers to questions using retrieved documents as context.&lt;/p&gt;

&lt;p&gt;I create a prompt template that asks the LLM to answer questions based on provided context, being precise and factual, with an instruction to explicitly say "I cannot answer this question based on the provided documents" if the context is insufficient.&lt;/p&gt;

&lt;p&gt;In the generate() method, I take a question string and a list of Document objects, then extract and concatenate all document text into one context string using double newlines as separators.&lt;/p&gt;

&lt;p&gt;I invoke the chain with the question and context, which substitutes them into the template, sends the request to DeepSeek, and returns the generated answer as a string. I wrap this in try-except to log both the answer and full context for debugging, and re-raise any exceptions that occur.&lt;/p&gt;

&lt;p&gt;Finally, I return a dictionary containing the draft answer and the context used, giving me both the generated response and traceability of what source material was used to create it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from typing import Dict, List
from langchain_core.documents import Document
from langchain_deepseek import ChatDeepSeek
from config.settings import settings
import logging

logger = logging.getLogger(__name__)

class ResearchAgent:
    def __init__(self):
        """Initialize the research agent with the OpenAI model."""
        # self.llm = ChatOpenAI(
        #     model="gpt-4-turbo",
        #     temperature=0.3,
        #     api_key=settings.OPENAI_API_KEY  # Pass the API key here
        # )
        self.llm = ChatDeepSeek(
            model="deepseek-chat",
            temperature=0.3,
            api_key=settings.DEEPSEEK_API_KEY  # Pass the API key here
        )
        self.prompt = ChatPromptTemplate.from_template(
            """Answer the following question based on the provided context. Be precise and factual.

            Question: {question}

            Context:
            {context}

            If the context is insufficient, respond with: "I cannot answer this question based on the provided documents."
            """
        )

    def generate(self, question: str, documents: List[Document]) -&amp;gt; Dict:
        """Generate an initial answer using the provided documents."""
        context = "\n\n".join([doc.page_content for doc in documents])

        chain = self.prompt | self.llm | StrOutputParser()
        try:
            answer = chain.invoke({
                "question": question,
                "context": context
            })
            logger.info(f"Generated answer: {answer}")
            logger.info(f"Context used: {context}")
        except Exception as e:
            logger.error(f"Error generating answer: {e}")
            raise

        return {
            "draft_answer": answer,
            "context_used": context
        }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Verification Agent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I created a VerificationAgent class that fact-checks AI-generated answers against source documents to catch hallucinations. In &lt;strong&gt;init&lt;/strong&gt;I initialise a deepseek-v3.2 model with temperature 0 (fully deterministic), create a prompt template that instructs the LLM to verify four aspects (direct factual support, unsupported claims, contradictions, and relevance) with a structured response format, then build a LangChain chain.&lt;/p&gt;

&lt;p&gt;In check()I take an answer string and a list of Document objects, concatenate all document text into one context string with double newlines, invoke the chain with the answer and context to get a verification report, log both the report and context for debugging in a try-except block, and return a dictionary with the verification report and context used for traceability.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from typing import Dict, List
from langchain_core.documents import Document
from langchain_deepseek import ChatDeepSeek
from config.settings import settings
import logging

logger = logging.getLogger(__name__)

class VerificationAgent:
    def __init__(self):
        # self.llm = ChatOpenAI(
        #     model="gpt-4-turbo",
        #     temperature=0,
        #     api_key=settings.OPENAI_API_KEY  # Pass the API key here
        # )
        self.llm = ChatDeepSeek(
            model="deepseek-chat",
            temperature=0,
            api_key=settings.DEEPSEEK_API_KEY  # Pass the API key here
        )
        self.prompt = ChatPromptTemplate.from_template(
            """Verify the following answer against the provided context. Check for:
            1. Direct factual support (YES/NO)
            2. Unsupported claims (list)
            3. Contradictions (list)
            4. Relevance to the question (YES/NO)

            Respond in this format:
            Supported: YES/NO
            Unsupported Claims: [items]
            Contradictions: [items]
            Relevant: YES/NO

            Answer: {answer}
            Context: {context}
            """
        )

    def check(self, answer: str, documents: List[Document]) -&amp;gt; Dict:
        """Verify the answer against the provided documents."""
        context = "\n\n".join([doc.page_content for doc in documents])

        chain = self.prompt | self.llm | StrOutputParser()
        try:
            verification = chain.invoke({
                "answer": answer,
                "context": context
            })
            logger.info(f"Verification report: {verification}")
            logger.info(f"Context used: {context}")
        except Exception as e:
            logger.error(f"Error verifying answer: {e}")
            raise

        return {
            "verification_report": verification,
            "context_used": context
        }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Conclusion :
&lt;/h1&gt;

&lt;p&gt;DeepSeek V3.2 doesn't win by scale, but by smarter thinking. With its sparse attention mechanism, lower cost, stronger long-context awareness, and superior tool-use inference capabilities, it demonstrates how open-source models can remain competitive without massive hardware budgets.&lt;/p&gt;

&lt;p&gt;While it may not top every benchmark, it significantly improves how users interact with AI today. And that's precisely why it stands out in a highly competitive market.&lt;/p&gt;

&lt;p&gt;I would highly appreciate it if you&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❣ Join my Patreon: &lt;a href="https://www.patreon.com/GaoDalie_AI" rel="noopener noreferrer"&gt;https://www.patreon.com/GaoDalie_AI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Book an Appointment with me: &lt;a href="https://topmate.io/gaodalie_ai" rel="noopener noreferrer"&gt;https://topmate.io/gaodalie_ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Support the Content (every Dollar goes back into the video):&lt;a href="https://buymeacoffee.com/gaodalie98d" rel="noopener noreferrer"&gt;https://buymeacoffee.com/gaodalie98d&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Subscribe to the Newsletter for free: &lt;a href="https://substack.com/@gaodalie" rel="noopener noreferrer"&gt;https://substack.com/@gaodalie&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>RAG Will Never Be the Same After Gemini File Search Tool</title>
      <dc:creator>Gao Dalie (Ilyass)</dc:creator>
      <pubDate>Tue, 18 Nov 2025 22:44:50 +0000</pubDate>
      <link>https://dev.to/gaodalie_ai/rag-will-never-be-the-same-after-gemini-file-search-tool-2je5</link>
      <guid>https://dev.to/gaodalie_ai/rag-will-never-be-the-same-after-gemini-file-search-tool-2je5</guid>
      <description>&lt;p&gt;Last week I heard bad news, and life hit me hard again. Moments like that remind me how fragile everything is — how one day we all leave, and even love can feel temporary.&lt;/p&gt;

&lt;p&gt;In the middle of all this, I saw a post on X saying Gemini’s File Search Tool makes RAG super easy and is being offered at a really reasonable cost. I don’t know why, but something about it pushed me to try it&lt;/p&gt;

&lt;p&gt;Google announced the File Search Tool, a fully managed search Augmentation generation (RAG) system built directly into the Gemini API.&lt;/p&gt;

&lt;p&gt;Previously, to build a RAG, you had to choose a vector database, develop a chunking strategy, call an embedding model, and tie everything together. The file search tool handles all of that automatically behind the API.&lt;/p&gt;

&lt;p&gt;These were major barriers for companies wanting to introduce AI, but with the introduction of the File Search Tool, these mechanisms can now be completed within the Gemini API.&lt;/p&gt;

&lt;p&gt;Developers can simply upload files and use standard API calls to generate answers based on their own data. clearly indicating which part of which file the AI agent ​​referenced when generating an answer. This helps prevent hallucination, a common problem with generative AI.&lt;/p&gt;

&lt;p&gt;The File Search Tool helps developers build file search and ingestion pipelines in a simple, integrated, and flexible way to enhance Gemini answers with their own data. Storing files and generating embeddings at the time they are created for free, with a one-time fee only for the initial indexing of files.&lt;/p&gt;

&lt;p&gt;So, let me give you a quick demo of a live chatbot to show you what I mean.&lt;/p&gt;

&lt;p&gt;Check a &lt;a href="https://www.youtube.com/watch?v=WORicSBIU0I" rel="noopener noreferrer"&gt;video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;During my development, one paper drew my attention. AI is increasingly involved across industries, influencing science and beyond. I will upload the Ocean AI PDF.&lt;/p&gt;

&lt;p&gt;I will ask the chatbot a question: “What is Ocean AI, and why is Ocean AI different from OpenAI?” If you take a look at how the chatbot generates the output, you’ll see that the agent first saves my uploaded PDF to a temporary file, then creates a unique FileSearchStore with a random ID.&lt;/p&gt;

&lt;p&gt;The agent uploads the PDF into this store and waits while Gemini breaks down the document into chunks and builds a searchable index — this is A wait_operation function that checks every 2 seconds until indexing finishes.&lt;/p&gt;

&lt;p&gt;When I type my question and hit enter, query_file_search sends it to the Gemini api along with the store name. Gemini automatically searches through the indexed PDF chunks, finds the relevant sections about Ocean AI and how it differs from OpenAI, uses those chunks as context, and generates an answer using the selected model.&lt;/p&gt;

&lt;p&gt;The response includes the answer text plus grounding metadata showing exactly which parts of the PDF were used, so when I click "View Sources", I can see the citations proving where the information came from. When I'm done, clicking "Clear PDF" deletes the entire store and cleans up all the data.&lt;/p&gt;

&lt;h1&gt;
  
  
  What makes File Search Tools?
&lt;/h1&gt;

&lt;p&gt;The Gemini API File Search Tool consolidates these complex processes into a single, fully automated API callgenerateContent , allowing developers to leverage file search functionality within their existing APIs, eliminating the complex setup and management work previously required.&lt;/p&gt;

&lt;p&gt;Unlike traditional keyword-based searches, the File Search Tool understands the meaning and context of your query and can find relevant information even if exact word matches are not used.&lt;/p&gt;

&lt;p&gt;This is achieved through powerful vector search, leveraging the latest Gemini Embedding model.&lt;/p&gt;

&lt;p&gt;Even more noteworthy is the implementation of auto-citation, which automatically includes citations to the specific documents used to generate the answer, greatly simplifying verification and fact-checking and making it much more useful for businesses.&lt;/p&gt;

&lt;h1&gt;
  
  
  Current limitations and expected improvements
&lt;/h1&gt;

&lt;p&gt;The File Search Tool currently has some limitations. The most important limitation is the limited ability to adjust the number of chunks retrieved. During testing, we confirmed advanced configuration options such as metadata filters, but we hope that future enhancements will allow for more detailed control of the number of chunks.&lt;/p&gt;

&lt;p&gt;There is also room for improvement in the accuracy of image recognition. Currently, it is possible to extract text from images, but it is not yet at the level of understanding the structure and relationships of diagrams. In particular, it can be difficult to extract meaningful information from documents written in Markdown format or with complex layouts.&lt;/p&gt;

&lt;p&gt;File size limitations are also a consideration. Each file is limited to a maximum of 100MB, and the file search store size for the entire project is limited to 1GB-1TB, depending on the user tier. These limitations may affect practicality for large enterprises.&lt;/p&gt;

&lt;h1&gt;
  
  
  Differences from OpenAI/Anthropic
&lt;/h1&gt;

&lt;p&gt;Currently, OpenAI’s Retrieval API and Anthropic’s File Contexts are well-known examples of RAG implementations. These systems use external storage to reference documents, but they require developers to build and manage a vector database, making them difficult to implement.&lt;/p&gt;

&lt;p&gt;On the other hand, the File Search Tool completely automates this part and is done entirely within the Gemini API. The table below compares the three major RAG solutions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwiybwbkek8hn1scj09o5.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwiybwbkek8hn1scj09o5.webp" alt=" " width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As can be seen from this comparison, the File Search Tool is superior in terms of both development burden and operational costs, and is particularly suitable for prototype development and experimental use by individual developers.&lt;/p&gt;

&lt;p&gt;In addition, the Gemini Embedding model provided by Google provides high search accuracy and is also a major attraction in that it can accurately extract information with similar meanings.&lt;/p&gt;

&lt;p&gt;Let’s start coding :&lt;/p&gt;

&lt;p&gt;Before we dive into our application, we will create an ideal environment for the code to work. For this, we need to install the necessary Python libraries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install requirements
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed and perform some basic configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import streamlit as st
import os
import time
import random
import string
import tempfile
from pathlib import Path
from PyPDF2 import PdfReader
from google import genai
from google.genai import types
from dotenv import load_dotenv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I designed these helper functions to handle the main tasks of the app. First, I created get_text(key, lang='en') to get translated text - it just looks up a word or phrase in a translation dictionary, defaults to English if the language doesn't exist, and returns the original key if nothing is found.&lt;/p&gt;

&lt;p&gt;Then I built generate_random_id(length=8) to make random IDs for naming stores - it randomly picks 8 characters from letters and numbers and combines them into a string.&lt;/p&gt;

&lt;p&gt;I developed wait_operation(client, op, sleep_sec=2, max_wait_sec=300) to wait for background operations to finish - it keeps checking every 2 seconds if the operation is done by calling the API, and if it takes longer than 5 minutes, it stops waiting and throws an error so the app doesn't hang forever.&lt;/p&gt;

&lt;p&gt;Next, I made extract_text_from_pdf(pdf_file, lang='en') to pull text out of PDF files - it opens the PDF, goes through each page one by one, grabs the text from each page, adds it all together with line breaks, and returns the complete text.&lt;/p&gt;

&lt;p&gt;I wrapped this in error handling, so if the PDF is broken or can't be read, it shows an error message to the user instead of crashing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def get_text(key, lang='en'):
    """Get translated text for the given key and language"""
    return TRANSLATIONS.get(lang, TRANSLATIONS['en']).get(key, key)

# Helper Functions
def generate_random_id(length=8):
    """Generate a random ID for store naming"""
    return ''.join(random.choices(string.ascii_lowercase + string.digits, k=length))

def wait_operation(client, op, sleep_sec=2, max_wait_sec=300):
    """Wait for Operations API to complete with timeout"""
    start = time.time()
    while not op.done:
        if time.time() - start &amp;gt; max_wait_sec:
            raise TimeoutError("Operation timed out.")
        time.sleep(sleep_sec)
        op = client.operations.get(op)
    return op

def extract_text_from_pdf(pdf_file, lang='en'):
    """Extract text content from uploaded PDF file"""
    try:
        pdf_reader = PdfReader(pdf_file)
        text = ""
        for page in pdf_reader.pages:
            text += page.extract_text() + "\n"
        return text
    except Exception as e:
        st.error(get_text('error_pdf_extract', lang).format(e))
        return None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Building on these utilities, I created three more functions to handle file management and storage. The save_uploaded_file(uploaded_file, lang='en') function takes care of saving uploaded files temporarily - it creates a temporary file that won't auto-delete, adds a .pdf extension to it, writes the uploaded file's content into it using getvalue(), and returns the file path so we can use it later with the other functions.&lt;/p&gt;

&lt;p&gt;Next, create_file_search_store(client, store_name, lang='en') sets up a new storage space using the random ID from generate_random_id - It calls the API to create a file search store with a custom display name, returns the store object if successful, or shows an error get_text and returns None if it fails.&lt;/p&gt;

&lt;p&gt;The last function, upload_file_to_store(client, file_path, store_name, display_name, lang='en'), actually uploads files into the store - it sends the file to the specified store using the API, adds some metadata like the source being "streamlit_upload" and a timestamp of when it was uploaded, then waits for the upload to complete using my wait_operation function from earlier, and returns the response once it's done.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def save_uploaded_file(uploaded_file, lang='en'):
    """Save uploaded file to temporary location"""
    try:
        with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file:
            tmp_file.write(uploaded_file.getvalue())
            return tmp_file.name
    except Exception as e:
        st.error(get_text('error_save_file', lang).format(e))
        return None

def create_file_search_store(client, store_name, lang='en'):
    """Create a new File Search Store"""
    try:
        store = client.file_search_stores.create(
            config={'display_name': store_name}
        )
        return store
    except Exception as e:
        st.error(get_text('error_create_store', lang).format(e))
        return None

def upload_file_to_store(client, file_path, store_name, display_name, lang='en'):
    """Upload file to File Search Store"""
    try:
        upload_op = client.file_search_stores.upload_to_file_search_store(
            file=file_path,
            file_search_store_name=store_name,
            config={
                'display_name': display_name,
                'custom_metadata': [
                    {"key": "source", "string_value": "streamlit_upload"},
                    {"key": "timestamp", "numeric_value": int(time.time())}
                ]
            }
        )
        upload_op = wait_operation(client, upload_op)
        return upload_op.response
    except Exception as e:
        st.error(get_text('error_upload_store', lang).format(e))
        return None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I built two final functions that let users interact with the uploaded files and clean up afterwards. The query_file_search(client, question, store_name, model, lang='en') function is where the magic happens - it takes a user's question and searches through the files in the store by calling the AI model with special file search tools configured.&lt;/p&gt;

&lt;p&gt;It passes the question to the model along with a reference to the store name we created earlier, and the model automatically searches through all the uploaded files to find relevant information and generate an answer.&lt;/p&gt;

&lt;p&gt;Like the other functions, it uses get_text for error messages and returns None if something goes wrong. After the user is done working with their files, cleanup_store(client, store_name, lang='en') handles the cleanup - it deletes the entire file search store, including all uploaded files, by calling the delete API with the force: True flag to make sure everything gets removed, returns True if successful or False If it fails, and shows an error message using the translation helper if anything breaks during deletion.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def query_file_search(client, question, store_name, model, lang='en'):
    """Query the File Search Store with a question"""
    try:
        response = client.models.generate_content(
            model=model,
            contents=question,
            config=types.GenerateContentConfig(
                tools=[
                    types.Tool(
                        file_search=types.FileSearch(
                            file_search_store_names=[store_name]
                        )
                    )
                ]
            )
        )
        return response
    except Exception as e:
        st.error(get_text('error_query', lang).format(e))
        return None

def cleanup_store(client, store_name, lang='en'):
    """Delete the File Search Store"""
    try:
        client.file_search_stores.delete(
            name=store_name,
            config={'force': True}
        )
        return True
    except Exception as e:
        st.error(get_text('error_cleanup', lang).format(e))
        return False
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Conclusion :
&lt;/h1&gt;

&lt;p&gt;The file search tool puts the advanced technology of RAG within the reach of all developers, not just a select few experts. This is truly the “democratisation of RAG.”&lt;/p&gt;

&lt;p&gt;from worries about complex infrastructure and costs, developers can focus on developing more creative applications that directly address user challenges.&lt;/p&gt;

&lt;p&gt;Combining your unique data with Gemini’s powerful intelligence will create new business value that was previously impossible. Let’s use this new tool to create the applications of the future!&lt;/p&gt;

&lt;p&gt;Reference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/gemini-api/docs/file-search" rel="noopener noreferrer"&gt;https://ai.google.dev/gemini-api/docs/file-search&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.google/technology/developers/file-search-gemini-api/" rel="noopener noreferrer"&gt;https://blog.google/technology/developers/file-search-gemini-api/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I would highly appreciate it if you&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❣ Join my Patreon: &lt;a href="https://www.patreon.com/GaoDalie_AI" rel="noopener noreferrer"&gt;https://www.patreon.com/GaoDalie_AI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Book an Appointment with me: &lt;a href="https://topmate.io/gaodalie_ai" rel="noopener noreferrer"&gt;https://topmate.io/gaodalie_ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Support the Content (every Dollar goes back into the video):&lt;a href="https://buymeacoffee.com/gaodalie98d" rel="noopener noreferrer"&gt;https://buymeacoffee.com/gaodalie98d&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Subscribe to the Newsletter for free: &lt;a href="https://substack.com/@gaodalie" rel="noopener noreferrer"&gt;https://substack.com/@gaodalie&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>programming</category>
    </item>
    <item>
      <title>DeepSeek-OCR + LLama4 + RAG Just Revolutionized Agent OCR Forever</title>
      <dc:creator>Gao Dalie (Ilyass)</dc:creator>
      <pubDate>Wed, 29 Oct 2025 07:48:52 +0000</pubDate>
      <link>https://dev.to/gaodalie_ai/deepseek-ocr-llama4-rag-just-revolutionized-agent-ocr-forever-5f8j</link>
      <guid>https://dev.to/gaodalie_ai/deepseek-ocr-llama4-rag-just-revolutionized-agent-ocr-forever-5f8j</guid>
      <description>&lt;p&gt;During the weekend, I scrolled through Twitter to see what was happening in the AI community. Once again, DeepSeek has drawn worldwide attention.&lt;/p&gt;

&lt;p&gt;This isn’t just any text recognition tool — it’s a brand-new contextual optical compression technology that uses visual methods to solve the challenge of processing long texts, offering a completely new approach to handling massive amounts of document information.&lt;/p&gt;

&lt;p&gt;Anyone who has used a large language model (LLM) has encountered a common pain point:&lt;/p&gt;

&lt;p&gt;When you ask the model to summarise tens of thousands of words from conference notes or academic papers, it starts to lose its memory.&lt;/p&gt;

&lt;p&gt;This is because the quadratic complexity of sequence length inherently limits GPT, Gemini, and Claude — the longer the input, the more computational power it requires.&lt;/p&gt;

&lt;p&gt;But humans aren’t like that.&lt;br&gt;
We can glance at a note or a diagram and instantly recall an entire passage.&lt;/p&gt;

&lt;p&gt;Traditionally, for AI to understand long documents, the entire document must be converted into digital text. This process consumes a large number of tokens (which can be understood as the units used by AI to process information), resulting in low computational efficiency.&lt;/p&gt;

&lt;p&gt;DeepSeek-OCR takes a different approach: it first converts text into images and then uses visual tokens to compress and represent this information. Imagine you have a 10,000-word article — instead of having AI read it word by word, it can simply “glance” at an image to understand and reconstruct the original text.&lt;/p&gt;

&lt;p&gt;The core breakthrough lies in its ability to represent rich information in a single image containing document text using far fewer tokens than the equivalent text. This means that optical compression with visual tokens can achieve higher compression ratios, allowing us to do more with fewer resources.&lt;/p&gt;

&lt;p&gt;So, let me give you a quick demo of a live chatbot to show you what I mean.&lt;/p&gt;

&lt;p&gt;Check a &lt;a href="https://www.youtube.com/watch?v=NkMqcRmspFs&amp;amp;t=509s" rel="noopener noreferrer"&gt;video&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;I will ask the chatbot a question: “What are the main findings?” If you take a look at how the chatbot generates the output, you’ll see that the agent extract text from each page, but if a page contains less than 50 characters or lacks embedded text, it converts that page into a high-resolution image and sends it to DeepSeek-OCR on Replicate, which uses an innovative “Contextual Optical Compression” approach where it converts the document into visual tokens and compresses the information — essentially allowing the AI to “glance” at an image representation rather than reading word-by-word, which can turn a 10,000-word article into a much more efficient compressed format.&lt;/p&gt;

&lt;p&gt;Once all text is extracted, the system breaks it into 500-character chunks with 50-character overlap to maintain context, converts each chunk into mathematical vectors using OpenAI embeddings, and stores them in a Chroma vector database that persists on disk for future use.&lt;/p&gt;

&lt;p&gt;When you ask a question, the agent searches through these vectors to find the 5 most semantically similar document chunks, assembles them into a context prompt along with your question and instructions to cite page numbers, then sends everything to the Llama 3.1 405B model running on Replicate’s streaming API, which processes the prompt and generates an intelligent answer chunk-by-chunk in real-time.&lt;/p&gt;

&lt;p&gt;Then generate the answer and the source document citations, showing which pages the information came from, creating a complete RAG agent that can understand any PDF&lt;/p&gt;
&lt;h1&gt;
  
  
  What makes DeepSeek-OCR Unique?
&lt;/h1&gt;

&lt;p&gt;DeepSeek-OCR is an end-to-end OCR and document parsing model designed to achieve optical context compression.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feuilw4ftckzg4r8jnwg5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feuilw4ftckzg4r8jnwg5.png" alt=" " width="800" height="264"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This model consists of two major components: a DeepEncoder that compresses high-resolution image input into a small number of visual tokens, and a DeepSeek-3B-MoE decoder (a Mixture-of-Experts language model) that restores the original text from the visual token sequence.&lt;/p&gt;

&lt;p&gt;DeepEncoder (approximately 380 million parameters) incorporates a SAM-based window attention mechanism for local image feature extraction, and by inserting a two-layer CNN with 16x compression in between, it significantly compresses a 1024x1024 pixel image from 4096 patches to around 256 tokens.&lt;/p&gt;

&lt;p&gt;The decoder side, which receives these visual tokens, has a total of 3 billion parameters (approximately 570 million are effective during inference) and features a MoE structure that dynamically utilises 6 experts per step from a pool of 64 experts, allowing for lightweight yet efficient text reconstruction.&lt;/p&gt;

&lt;p&gt;With this architecture, DeepSeek-OCR takes an unconventional approach by converting the contents of a text document into an “image” and then reading it.&lt;/p&gt;
&lt;h1&gt;
  
  
  PaddleOCR-VL Vs DeepSeek-OCR :
&lt;/h1&gt;

&lt;p&gt;Check the video PaddleOCR-VL: &lt;a href="https://www.youtube.com/watch?v=brq5rPkTfyw" rel="noopener noreferrer"&gt;Video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When I tested both OCR models, I found something interesting — PaddleOCR-VL, which has fewer parameters (0.9B), was beating much larger 3B models in real-world tests.&lt;/p&gt;

&lt;p&gt;I gave it tough jobs: reading vertical text in the right direction, understanding complex math formulas, and handling documents with multiple columns — and PaddleOCR-VL nailed them all, while DeepSeek-OCR made mistakes with reading order and formulas, even though it has cool compression features.&lt;/p&gt;

&lt;p&gt;Then I discovered something fun in DeepSeek-OCR’s research paper — they actually thanked PaddleOCR and admitted they used it to label their training data, which made me realize why companies like Baidu, DeepSeek, and Shanghai AI Lab are all releasing OCR models: they’re not making OCR tools as their main job, they’re building them to clean up huge amounts of data for training their AI models, and we’re getting these powerful OCR tools as free bonuses.&lt;/p&gt;

&lt;p&gt;After testing everything, I figured out that if you’re building something for real work and need to read printed text, forms, tables, or documents in different languages, PaddleOCR-VL is the way to go, while DeepSeek-OCR is better if you’re a researcher trying to compress data to save money on AI costs.&lt;/p&gt;
&lt;h1&gt;
  
  
  Text Tokens vs. Visual Tokens: The Fundamental Difference
&lt;/h1&gt;

&lt;p&gt;In traditional LLMs, text is broken down into discrete text tokens (typically words or subwords). Each token is assigned a fixed ID in the vocabulary and mapped into a vector via a large “lookup table” (embedded layer). While this process is efficient, its expressive power is limited by the limited vocabulary.&lt;/p&gt;

&lt;p&gt;Visual Tokens are completely different. Instead of coming from a fixed lookup table, they are continuous vectors generated directly from image pixels by a neural network (visual encoder). This means:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Higher information density:&lt;/code&gt; Visual tokens exist in a continuous vector space and can encode richer and more nuanced information than discrete text tokens. A visual token can represent the color, shape, texture, and spatial relationships within an area, rather than just a word or subword.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Global pattern perception:&lt;/code&gt; The visual encoder can capture global information, such as the overall layout, typesetting, and font style of the text, which is lost in the plain text token sequence.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Larger expression space:&lt;/code&gt; In theory, the “vocabulary” of visual tokens is infinite because they are continuous vectors generated directly from pixels rather than selected from a fixed dictionary.&lt;/p&gt;

&lt;p&gt;Let’s start coding :&lt;/p&gt;

&lt;p&gt;Before we dive into our application, we will create an ideal environment for the code to work. For this, we need to install the necessary Python libraries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
pip install requirements
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed and perform some basic configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import os
import replicate
from langchain_openai import OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_core.language_models.llms import LLM
from typing import List, Optional, Any
import fitz
from pathlib import Path
from dotenv import load_dotenv

load_dotenv()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I developed this custom Llama class by inheriting from LangChain's base LLM class and configuring it with the Llama 3.1 405B model identifier, token limits, and temperature settings.&lt;/p&gt;

&lt;p&gt;I implemented the required _llm_type property to return an identifier, then I built the core _call method, which takes a prompt, packages it with the configuration into a dictionary, sends it to Replicate's streaming API, and loops through the response chunks to concatenate them into a complete answer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class Llama(LLM):
    model: str = "meta/meta-llama-3.1-405b-instruct"
    max_tokens: int = 1024
    temperature: float = 0.7

    @property
    def _llm_type(self) -&amp;gt; str:
        return "replicate_llama"

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -&amp;gt; str:
        input_data = {
            "prompt": prompt,
            "max_tokens": self.max_tokens,
            "temperature": self.temperature
        }

        output = ""
        for event in replicate.stream(self.model, input=input_data):
            output += str(event)

        return output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I built this OCRPDFLoader class to extract text from PDFs by first trying text extraction and falling back to OCR when needed. I initialised it with a file path, an optional OCR flag, and a text threshold (default 50 characters) to detect if a page has enough text.&lt;/p&gt;

&lt;p&gt;In the load method, I opened the PDF with PyMuPDF, looped through each page to extract text, then checked if OCR was forced or if the extracted text was below the threshold - if so,&lt;/p&gt;

&lt;p&gt;I called my _ocr_page method, which I built to convert the page into a high-resolution PNG image, send it to Replicate's DeepSeek-OCR API, get the OCR text back, clean up the temporary image, and return the extracted text.&lt;/p&gt;

&lt;p&gt;Finally, I packaged each page's text into LangChain Document objects with metadata (source file, page number, filename) and returned them as a list, giving me a smart loader that automatically handles both digital and scanned PDFs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class OCRPDFLoader:
    def __init__(self, file_path: str, use_ocr: bool = False, text_threshold: int = 50):
        self.file_path = file_path
        self.use_ocr = use_ocr
        self.text_threshold = text_threshold

    def load(self) -&amp;gt; List[Document]:
        doc = fitz.open(self.file_path)
        documents = []

        for page_num in range(len(doc)):
            page = doc[page_num]
            text = page.get_text()

            if self.use_ocr or len(text.strip()) &amp;lt; self.text_threshold:
                print(f"OCR: page {page_num + 1}")
                text = self._ocr_page(page, page_num)

            if text.strip():
                documents.append(Document(
                    page_content=text.strip(),
                    metadata={
                        'source': self.file_path,
                        'page': page_num + 1,
                        'filename': Path(self.file_path).name
                    }
                ))

        doc.close()
        return documents

    def _ocr_page(self, page, page_num, temp_dir='./temp_ocr'):
        os.makedirs(temp_dir, exist_ok=True)

        pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
        img_path = f"{temp_dir}/page_{page_num}.png"
        pix.save(img_path)

        with open(img_path, "rb") as image_file:
            input_data = {
                "image": image_file,
                "task_type": "Free OCR"
            }

            output = replicate.run(
                "lucataco/deepseek-ocr:cb3b474fbfc56b1664c8c7841550bccecbe7b74c30e45ce938ffca1180b4dff5",
                input=input_data
            )

        os.remove(img_path)
        return output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, I built this LangChainPDFRAG. The lass is the main orchestrator that ties everything together into a complete RAG system. I initialised it by setting up my custom Llama model for generating answers, OpenAI embeddings for converting text into vectors, a text splitter that breaks documents into 500-character chunks with 50-character overlap to maintain context between chunks, and a Chroma vector database that I configured to persist on disk so it could reload existing data between sessions.&lt;/p&gt;

&lt;p&gt;I created the add_pdf method, which uses my OCR loader to extract text from PDFs, splits that text into manageable chunks, then either creates a new vector store or adds to an existing one by converting each chunk into embeddings and storing them for semantic search.&lt;/p&gt;

&lt;p&gt;Finally, I implemented the query method where I set up a retriever to find the 5 most relevant document chunks, built a LangChain chain that takes a user's question, retrieves relevant context, formats it into a prompt template asking the LLM to cite page numbers, passes everything to my Llama model for generation, and returns both the generated answer and the source documents with their page numbers - essentially creating a complete question-answering system that can intelligently search through PDFs and provide accurate, cited responses.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class LangChainPDFRAG:
    def __init__(self, 
                 llm_model='meta/meta-llama-3.1-405b-instruct',
                 embedding_model='text-embedding-3-small',
                 persist_directory='./chroma_db'):

        self.llm = Llama(model=llm_model)
        self.embeddings = OpenAIEmbeddings(model=embedding_model)
        self.persist_directory = persist_directory
        self.vectorstore = None

        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=50,
            separators=["\n\n", "\n", ". ", " ", ""]
        )

        if os.path.exists(persist_directory):
            self.vectorstore = Chroma(
                persist_directory=persist_directory,
                embedding_function=self.embeddings
            )

    def add_pdf(self, pdf_path: str, use_ocr: bool = False):
        loader = OCRPDFLoader(pdf_path, use_ocr=use_ocr)
        documents = loader.load()
        splits = self.text_splitter.split_documents(documents)

        if self.vectorstore is None:
            self.vectorstore = Chroma.from_documents(
                documents=splits,
                embedding=self.embeddings,
                persist_directory=self.persist_directory
            )
        else:
            self.vectorstore.add_documents(splits)

        print(f"Added {len(splits)} chunks from {Path(pdf_path).name}")
        return len(splits)

    def query(self, question: str):
        if self.vectorstore is None:
            raise ValueError("No documents.")

        retriever = self.vectorstore.as_retriever(search_kwargs={"k": 5})

        def format_docs(docs):
            return "\n\n".join([doc.page_content for doc in docs])

        prompt = ChatPromptTemplate.from_template(
            "You are a helpful assistant. Answer based on the context provided. Cite page numbers when relevant.\n\n"
            "Context:\n{context}\n\n"
            "Question: {question}\n\n"
            "Answer:"
        )

        chain = (
            {"context": retriever | format_docs, "question": RunnablePassthrough()}
            | prompt
            | self.llm
            | StrOutputParser()
        )

        docs = retriever.invoke(question)
        answer = chain.invoke(question)

        return {
            'answer': answer,
            'sources': [
                {
                    'filename': doc.metadata.get('filename'),
                    'page': doc.metadata.get('page'),
                    'content': doc.page_content[:200]
                }
                for doc in docs
            ]
        }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I instantiated the RAG system with Llama 3.1 405B, loaded a PDF into the vector database, and queried it with a question. The Agent retrieved relevant document chunks, generated an answer, and returned both the answer and source citations&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if __name__ == "__main__":
    # Using Llama 3.1 405B from Replicate
    rag = LangChainPDFRAG(llm_model='meta/meta-llama-3.1-405b-instruct')

    rag.add_pdf('TSLA-Q2-2025-Update.pdf', use_ocr=False)

    result = rag.query('What are the main findings?')

    print("=== Answer ===")
    print(result['answer'])

    print("\n=== Sources ===")
    for source in result['sources']:
        print(f"- {source['filename']}, Page {source['page']}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Conclusion :&lt;/p&gt;

&lt;p&gt;DeepSeek-OCR is not just a more powerful OCR tool, but a research paper that opens a new chapter. The concept of visual-text compression that it proposes offers an imaginative path to solving one of the biggest challenges facing current large-scale models: the bottleneck of long context processing efficiency.&lt;/p&gt;

&lt;p&gt;By “rendering” textual information as two-dimensional images and compressing it into information-dense visual tokens using an efficient visual encoder, DeepSeek-OCR demonstrates that AI can “see images” like humans can, allowing it to understand and remember large amounts of information more efficiently.&lt;/p&gt;

&lt;p&gt;I would highly appreciate it if you&lt;/p&gt;

&lt;p&gt;❣ Join my Patreon: &lt;a href="https://www.patreon.com/GaoDalie_AI" rel="noopener noreferrer"&gt;https://www.patreon.com/GaoDalie_AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Book an Appointment with me: &lt;a href="https://topmate.io/gaodalie_ai" rel="noopener noreferrer"&gt;https://topmate.io/gaodalie_ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Support the Content (every Dollar goes back into the video):&lt;a href="https://buymeacoffee.com/gaodalie98d" rel="noopener noreferrer"&gt;https://buymeacoffee.com/gaodalie98d&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Subscribe to the Newsletter for free: &lt;a href="https://substack.com/@gaodalie" rel="noopener noreferrer"&gt;https://substack.com/@gaodalie&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>programming</category>
    </item>
    <item>
      <title>PaddleOCR VL + RAG: Revolutionize Complex Data Extraction (Open-Source)</title>
      <dc:creator>Gao Dalie (Ilyass)</dc:creator>
      <pubDate>Fri, 24 Oct 2025 16:55:07 +0000</pubDate>
      <link>https://dev.to/gaodalie_ai/paddleocr-vl-rag-revolutionize-complex-data-extraction-open-source-lja</link>
      <guid>https://dev.to/gaodalie_ai/paddleocr-vl-rag-revolutionize-complex-data-extraction-open-source-lja</guid>
      <description>&lt;p&gt;Not even a month ago, I made a video about MistralOCR that many of you liked. &lt;/p&gt;

&lt;p&gt;After that, a follower reached out with a problem they were having with an OCR Chatbot. I figured this was a common issue, so I decided to make a new video to help them and other developers.&lt;/p&gt;

&lt;p&gt;When documents contain complex tables, mathematical formulas, or multi-column layouts, traditional OCR tools often generate messy content that requires manual sorting.&lt;/p&gt;

&lt;p&gt;Then, just last week, I was browsing GitHub and came across Baidu's newly open-sourced PaddleOCR-VL-0.9B. &lt;/p&gt;

&lt;p&gt;I'll be honest - when I saw it had only 0.9 billion parameters, my first thought was " Oh, another small model joining the fun?" But out of professional curiosity, I had to ask: could this one actually deliver? What I found completely stunned me.&lt;/p&gt;

&lt;p&gt;This isn't OCR, it's a quantum leap in document understanding&lt;br&gt;
PaddleOCR-VL completely exceeded my expectations. It achieved the world's first place in comprehensive performance, scoring 92.6 on the global authoritative evaluation list, OmniDocBench v1.5. Its inference speed increased by 14.2% compared with MinerU2.5 and 253.01% compared with dots.ocr.&lt;/p&gt;

&lt;p&gt;The most intuitive feeling I had was that it was very accurate, or too accurate! It is worthy of being the model that can reach the top and be ranked first.&lt;/p&gt;

&lt;p&gt;So, let me give you a quick demo of a live chatbot to show you what I mean.&lt;/p&gt;

&lt;p&gt;Check a &lt;a href="https://www.youtube.com/watch?v=brq5rPkTfyw" rel="noopener noreferrer"&gt;video &lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Today, I'll be putting PaddleOCR-VL to the test on four key challenges: Formula Recognition, Table Recognition, Reading Order, and Handwritten Text.&lt;/p&gt;

&lt;p&gt;Let's start with Formula Recognition. I've uploaded an image containing complex mathematical formulas. As you can see, the model handles them exceptionally well - accurately interpreting superscripts, subscripts, and even very long, intricate expressions.&lt;br&gt;
Next up is Table Recognition. &lt;/p&gt;

&lt;p&gt;This is a notoriously difficult problem, and there are many types of tables, sometimes with borders and sometimes without, containing numerous numbers that are very easy for models to misinterpret. I used PaddleOCR-VL on several table examples and found its accuracy to be genuinely impressive.&lt;/p&gt;

&lt;p&gt;Another major challenge is understanding document Structure and Reading Order. In modern documents, content is not only more complex but also comes in highly varied layouts. Think multi-column designs, mixed text and images, folds, color printing, tilted scans, and handwritten annotations - all of which complicate OCR. The correct reading order isn't always a simple top-to-bottom, left-to-right flow.&lt;/p&gt;

&lt;p&gt;The PaddleOCR-VL technical report demonstrates how the model can understand these complex structures, almost like a human. Whether it's an academic paper, a multi-column newspaper, or a technical report, it intelligently analyzes the layout and restores a reading order that matches human intuition.&lt;/p&gt;

&lt;p&gt;Finally, PaddleOCR-VL remains extremely stable even with more complex layouts. Take this handwritten note, for example. It combines text, numbers, paragraphs, and images in a layout with left-right and top-bottom columns that typically only a human could decipher.&lt;/p&gt;
&lt;h1&gt;
  
  
  What Makes PaddleOCR VL Unique?
&lt;/h1&gt;

&lt;p&gt;PaddleOCR VL is no longer just simple text recognition, but can really "understand" the document structure. Whether it is an academic paper, a multi-column newspaper or a technical report, PaddleOCR-VL can intelligently understand the document layout and automatically organise the content in the correct order.&lt;/p&gt;

&lt;p&gt;At the same time, it accurately extracts complex content information, such as tables, mathematical formulas, handwritten notes, and chart data in documents. It converts them into structured data that can be directly used.&lt;/p&gt;

&lt;p&gt;In addition, it supports recognition of 109 languages, covering multilingual scenarios such as Chinese, English, French, Japanese, Russian, Arabic, and Spanish, greatly improving the model's recognition and processing capabilities in multilingual documents.&lt;/p&gt;
&lt;h1&gt;
  
  
  How PaddleOCR VL It Trained
&lt;/h1&gt;

&lt;p&gt;PaddleOCR-VL consists of two parts: PP-DocLayoutV2 and PaddleOCR-VL-0.9B.&lt;/p&gt;

&lt;p&gt;Among them, the core part is PaddleOCR-VL-0.9B, which integrates a pre-trained visual encoder with a dynamic resolution preprocessor, a two-layer MLP projector, and a pre-trained large language model.&lt;br&gt;
The preprocessing technology uses native dynamic high resolution. The visual encoder uses the NaViT style encoder, which supports native resolution input.&lt;/p&gt;

&lt;p&gt;This design reduces hallucinations and improves the performance of the visual language model PaddleOCR-VL-0.9B.&lt;br&gt;
The projector efficiently connects the features of the visual encoder to the embedding space of the language model.&lt;/p&gt;

&lt;p&gt;In an autoregressive language model, the entire sequence is generated by predicting one token at a time. This means that the size of the decoder directly affects the overall inference latency, so smaller models decode faster.&lt;/p&gt;

&lt;p&gt;Let's start coding&lt;/p&gt;

&lt;p&gt;Let us now explore step by step and unravel the answer to creating a powerful reasoning app. We will install the libraries that support the model. For this, we will do a pip install&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!pip uninstall -y torch paddlepaddle paddlepaddle-gpu
!pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
!pip install paddleocr paddlepaddle
!pip install langchain langchain-community langchain-openai faiss-cpu sentence-transformers openai python-dotenv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed and perform some basic configuration.&lt;/p&gt;

&lt;p&gt;PaddleOCR: converts documents and images into structured, AI-friendly data (like JSON and Markdown) with industry-leading accuracy - powering AI applications.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import torch
from paddleocr import PaddleOCR
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.docstore.document import Document
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So I built this SimpleRAG system that combines PaddleOCR-VL for text extraction with OpenAI for generating queries. Let me walk you through what I developed here.&lt;/p&gt;

&lt;p&gt;In the initialisation, I set up the core components - I'm using HuggingFace's BGE embeddings for vector representations and GPT-4o as the chat model with zero temperature for consistent responses. I initialize placeholders for the vectorstore and QA chain that we'll build later.&lt;/p&gt;

&lt;p&gt;Now, for the extraction method, first I tried using the HuggingFace transformers version of PaddleOCR, which threw a weird error about image tokens not matching, then installing PaddlePaddle actually broke PyTorch (had to restart the runtime and reinstall everything in the right order), then I kept guessing at the API because the methods were deprecated and the new ones had different parameters. &lt;/p&gt;

&lt;p&gt;The real breakthrough came when I just printed out what the result object actually looked like - turns out it's just a list with one dictionary inside, and that dictionary has a key called rec_texts which is literally just a list of all the text strings that were found in the image.&lt;/p&gt;

&lt;p&gt;So instead of trying to access some complex nested object structure with .boxes.text I just needed to check if the result was a dictionary, grab the rec_texts key, and extend my list with those strings.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class SimpleRAG:
    def __init__(self):
        self.embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
        self.llm = ChatOpenAI(model="gpt-4o", temperature=0)
        self.vectorstore = None
        self.qa_chain = None
        self.ocr = PaddleOCR(use_textline_orientation=True, lang='en')

    def extract_text_from_images(self, image_paths: list):
        docs = []
        for path in image_paths:
            result = self.ocr.predict(input=path)

            text_lines = []
            for res in result:
                if isinstance(res, dict) and 'rec_texts' in res:
                    text_lines.extend(res['rec_texts'])

            text = "\n".join(text_lines) if text_lines else "No text found"
            docs.append(Document(page_content=text, metadata={'source': path}))

        return docs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In build_index, extract text from all images, split the documents into 1000-character chunks with 200-character overlap using RecursiveCharacterTextSplitter, create a FAISS vectorstore with BGE embeddings, and set up a RetrievalQA chain that uses GPT-4o and retrieves the top 3 relevant chunks per query.&lt;/p&gt;

&lt;p&gt;For a query, I just pass the question to the QA chain, which handles retrieval and generation, returning the answer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def build_index(self, image_paths: list):
        docs = self.extract_text_from_images(image_paths)

        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
        splits = text_splitter.split_documents(docs)

        self.vectorstore = FAISS.from_documents(splits, self.embeddings)
        self.qa_chain = RetrievalQA.from_chain_type(
            llm=self.llm,
            retriever=self.vectorstore.as_retriever(search_kwargs={"k": 3})
        )
def query(self, question: str):
        return self.qa_chain.invoke(question)

# Usage
rag = SimpleRAG()
rag.build_index(["Your pic"])
answer = rag.query("extract all the table?")
print(answer)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Conclusion :
&lt;/h1&gt;

&lt;p&gt;In this era of rapidly advancing AI technology, we're often bombarded with hype about "the most powerful ever" and "disruptive." However, truly valuable breakthroughs often come from innovations that solve specific problems and make technology easier to use.&lt;/p&gt;

&lt;p&gt;PaddleOCR-VL may not make mainstream headlines, but for developers who need to process documents every day, it may be the long-awaited solution.&lt;/p&gt;

&lt;p&gt;After all, the best technologies are those that are quietly integrated into daily work, making you hardly aware of their existence. PaddleOCR-VL is taking a solid step in this direction.&lt;/p&gt;

&lt;p&gt;🧙‍♂️ I am an AI Generative expert! If you want to collaborate on a project, drop an inquiry here or book a 1-on-1 Consulting Call With Me.&lt;/p&gt;

&lt;p&gt;I would highly appreciate it if you&lt;/p&gt;

&lt;p&gt;❣ Join my Patreon: &lt;a href="https://www.patreon.com/GaoDalie_AI" rel="noopener noreferrer"&gt;https://www.patreon.com/GaoDalie_AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Book an Appointment with me: &lt;a href="https://topmate.io/gaodalie_ai" rel="noopener noreferrer"&gt;https://topmate.io/gaodalie_ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Support the Content (every Dollar goes back into the video):&lt;a href="https://buymeacoffee.com/gaodalie98d" rel="noopener noreferrer"&gt;https://buymeacoffee.com/gaodalie98d&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Subscribe to the Newsletter for free: &lt;a href="https://substack.com/@gaodalie" rel="noopener noreferrer"&gt;https://substack.com/@gaodalie&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>programming</category>
      <category>ai</category>
    </item>
    <item>
      <title>RAG is Not Dead! No Chunking, No Vectors, Just Vectorless to Get the Higher Accuracy</title>
      <dc:creator>Gao Dalie (Ilyass)</dc:creator>
      <pubDate>Thu, 16 Oct 2025 17:43:44 +0000</pubDate>
      <link>https://dev.to/gaodalie_ai/rag-is-not-dead-no-chunking-no-vectors-just-vectorless-to-get-the-higher-accuracy-1iba</link>
      <guid>https://dev.to/gaodalie_ai/rag-is-not-dead-no-chunking-no-vectors-just-vectorless-to-get-the-higher-accuracy-1iba</guid>
      <description>&lt;p&gt;Over the past two years, I have written numerous articles on how Retrieval-Augmented Generation has become a standard feature in nearly all AI applications.&lt;/p&gt;

&lt;p&gt;Whether it's intelligent customer service, enterprise knowledge bases, financial analysis, or legal document Q&amp;amp;A, they all use the same logic: document segmentation, vectorisation, matching using cosine similarity, and then feeding the retrieved content into a large model for answering.&lt;/p&gt;

&lt;p&gt;This solution is simple and effective, but the problem is also obvious - when the problem becomes complex, spans multiple pages, or even involves multiple layers of logic, vector similarity retrieval often goes in the wrong direction. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;You asked, "What will be the year-over-year change in the company's cash flow from operating activities in 2023?"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A traditional RAG might find a bunch of paragraphs containing "cash flow" but miss out on key context: operating activities vs. investing activities, 2023 vs. 2022.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The result: high similarity, but poor correlation.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many RAGs have used a technology called vector DB, which converts text into numerical values ​​to search for "similar text." Still, the problem is that "similar" does not necessarily mean "desired information. "&lt;/p&gt;

&lt;p&gt;For example, a common problem occurs when a similar paragraph in a manual is hit, but important conditions or exceptions are overlooked.&lt;br&gt;
That's where PageIndex comes in. This is a new RAG mechanism devised by Vectify AI, and the idea behind it is very simple.&lt;/p&gt;

&lt;p&gt;When people read a book, they first look at the table of contents, open the chapter they are looking for, and then follow the subheadings to get to the desired location.&lt;/p&gt;

&lt;p&gt;PageIndex lets AI do exactly this, allowing you to find " truly related parts " rather than "similar sentences ."&lt;/p&gt;

&lt;p&gt;So, let me give you a quick demo of a live chatbot to show you what I mean.&lt;/p&gt;

&lt;p&gt;Check a &lt;a href="https://www.youtube.com/watch?v=97GkSYzr6yk" rel="noopener noreferrer"&gt;video &lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I will ask the chatbot a question: "What is DeepSeek-R1-Zero?" Feel free to ask any questions you want.&lt;/p&gt;

&lt;p&gt;If you look at how the chatbot generates the output, you will see that when I input a query, the agent first loads a PDF file, downloads it locally using Python's requests module, and saves it to a structured folder. It then submits the PDF to PageIndex, which builds a hierarchical tree structure of the document, organising it into natural sections and generating summaries for each node. &lt;/p&gt;

&lt;p&gt;PageIndex is a new reasoning-based, vectorless RAG framework that performs retrieval in two steps: first, it generates a tree structure index of documents, and second, it performs reasoning-based retrieval through tree search. Unlike traditional vector-based RAG systems, PageIndex does not require vectors or artificial chunking, simulates human-like navigation through the document, and provides a transparent, reasoning-based process instead of approximate semantic search.&lt;/p&gt;

&lt;p&gt;Next, I prepare a carefully crafted prompt for the LLM that includes my question and the simplified tree (with text removed to reduce size) and ask the model to identify the nodes most likely to contain the answer, returning both its reasoning and a list of node IDs in structured JSON. &lt;/p&gt;

&lt;p&gt;I then create a mapping of all nodes in the document tree, parse the LLM response, and print the model's reasoning for why it selected certain nodes. After that, I loop through the identified node IDs, retrieve their titles, page numbers, and text, and compile this content into a readable context.&lt;/p&gt;
&lt;h2&gt;
  
  
  What is Pageindex?
&lt;/h2&gt;

&lt;p&gt;PageIndex is a new method for improving RAG accuracy. In normal RAG, sentences are vectorised and then highly similar sentences are searched for and referenced. However, this method retrieves information that is "similar in meaning but different in context," which reduces the accuracy of the answer&lt;/p&gt;

&lt;p&gt;Therefore, PageIndex proposes a RAG that does not use a vector database.&lt;br&gt;
Specifically, the PageIndex method converts a document into a hierarchical tree structure (similar to a table of contents), and LLM searches through that structure, making it possible to understand the context and find the information you need, just like a human reading a document.&lt;/p&gt;
&lt;h1&gt;
  
  
  How does it work?
&lt;/h1&gt;

&lt;p&gt;PageIndex works in three major steps.&lt;/p&gt;
&lt;h3&gt;
  
  
  OCR (clear document reading)
&lt;/h3&gt;

&lt;p&gt;While ordinary OCR processes each page, which can lead to disorganised headings and lists, PageIndex's OCR understands the entire document as a single structure and digitises it neatly while preserving headings and tables.&lt;/p&gt;
&lt;h3&gt;
  
  
  Tree Generation (Create a table of contents tree)
&lt;/h3&gt;

&lt;p&gt;Convert documents directly into a hierarchical structure, like a table of contents. A tree structure with chapters, sections, and subsections is created, making it easy to navigate even long reports without getting lost.&lt;/p&gt;
&lt;h3&gt;
  
  
  Retrieval (searching by tracing the tree)
&lt;/h3&gt;

&lt;p&gt;The AI ​​searches the tree based on the question and picks up all relevant parts. It also knows which pages and chapters have been visited, so the search results are well-founded.&lt;/p&gt;
&lt;h2&gt;
  
  
  PageIndex Vs Conventional RAG
&lt;/h2&gt;

&lt;p&gt;Conventional RAG vectorises entire documents and stores them in a vector database. It then searches for relevant documents based on the similarity between the user's question and the content of the documents.&lt;/p&gt;

&lt;p&gt;However, this method relies only on the statistical similarity of words and sentences, so it may not always capture true relevance. &lt;/p&gt;

&lt;p&gt;Long documents are also broken into chunks, which disrupts context and hides important connections. PageIndex solves these problems by using the inherent hierarchical structure of documents without breaking them down into small pieces&lt;/p&gt;

&lt;p&gt; This allows LLMs to retrieve information based on contextual semantic relevance rather than simple word similarity.&lt;/p&gt;
&lt;h2&gt;
  
  
  Let's start coding 
&lt;/h2&gt;

&lt;p&gt;Let us now explore step by step and unravel the answer to how to create a Vectorless. We will install the libraries that support the model. For this, we will do a pip install requirements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%pip install -q --upgrade pageindex
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;0.2 Setup PageIndex&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First, I import the PageIndexClient class from the pageindex package and also bring in some helper functions from pageindex. Then, I generate my own API key, which I copy and paste into the variable PAGEINDEX_API_KEY.&lt;/p&gt;

&lt;p&gt;After that, I create a client instance called pi_client by passing my API key into PageIndexClient, and now I'm ready to use this client to interact with the PageIndex API for searching, indexing, or managing documents&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pageindex import PageIndexClient
import pageindex.utils as utils

# Get your PageIndex API key from https://dash.pageindex.ai/api-keys
PAGEINDEX_API_KEY = "YOUR_PAGEINDEX_API_KEY"
pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;0.3 Setup LLM&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;let's import the openai package and set my OPENAI_API_KEY to the value I got from the OpenAI dashboard; then, I define an asynchronous function called call_llm that takes a prompt, with optional parameters model (defaulting to "gpt-4.1") and temperature (defaulting to 0 for deterministic answers); inside the function, I create a new AsyncOpenAI client using my API key; &lt;/p&gt;

&lt;p&gt;Next, I call client.chat.completions.create(...) where I pass in the model name, the conversation messages (in this case, just a single user message with my prompt), and the temperature; once the response comes back, I take the first choice's message content, strip extra whitespace, and return it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import openai
OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"

async def call_llm(prompt, model="gpt-4.1", temperature=0):
    client = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)
    response = await client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature
    )
    return response.choices[0].message.content.strip()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 1: PageIndex Tree Generation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I import the os and requests modules since I need them to handle file paths and download files; then, I set pdf_url to the link of the paper I want to fetch, a PDF from arXiv; next, I build a local path called pdf_path with the filename extracted from the URL. &lt;/p&gt;

&lt;p&gt;After that, I sent a GET request to download the PDF using requests. get, open a new file in write-binary mode, and save the content locally; once the file is saved, &lt;/p&gt;

&lt;p&gt;I finally use pi_client.submit_document(pdf_path) to upload the saved PDF into PageIndex, take the returned "doc_id", and print it out to confirm the document has been successfully submitted for indexing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import os, requests

# You can also use our GitHub repo to generate PageIndex tree
# https://github.com/VectifyAI/PageIndex

pdf_url = "https://arxiv.org/pdf/2501.12948.pdf"
pdf_path = os.path.join("../data", pdf_url.split('/')[-1])
os.makedirs(os.path.dirname(pdf_path), exist_ok=True)

response = requests.get(pdf_url)
with open(pdf_path, "wb") as f:
    f.write(response.content)
print(f"Downloaded {pdf_url}")

doc_id = pi_client.submit_document(pdf_path)["doc_id"]
print('Document Submitted:', doc_id)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;1.2 Get the generated PageIndex tree structure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After I've submitted the document and received a doc_id, check if the document is ready for retrieval by calling pi_client.is_retrieval_ready(doc_id). If it's ready, I then call pi_client.get_tree(doc_id, node_summary=True), which gives me the structured outline of the document, and I extract the ['result'] part from the response&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if pi_client.is_retrieval_ready(doc_id):
    tree = pi_client.get_tree(doc_id, node_summary=True)['result']
    print('Simplified Tree Structure of the Document:')
    utils.print_tree(tree)
else:
    print("Processing document, please try again later...")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Reasoning-Based Retrieval with Tree Search&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Next, I import JSON to handle and format the document tree. I set up my query, here asking, "What are the conclusions in this document?" To simplify the tree, I removed the full text and kept only titles and summaries, using utils.remove_fields. &lt;br&gt;
I create a search_prompt that tells the LLM to identify relevant nodes and return a JSON with "thinking" and "node_list". I embed the question and the simplified tree into this prompt. Finally, I call call_llm(search_prompt) to obtain structured JSON that points to the most relevant parts of the document.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json

query = "What are the conclusions in this document?"

tree_without_text = utils.remove_fields(tree.copy(), fields=['text'])

search_prompt = f"""
You are given a question and a tree structure of a document.
Each node contains a node ID, a node title, and a corresponding summary.
Your task is to find all nodes that are likely to contain the answer to the question.

Question: {query}

Document tree structure:
{json.dumps(tree_without_text, indent=2)}

Please reply in the following JSON format:
{{
    "thinking": "&amp;lt;Your thinking process on which nodes are relevant to the question&amp;gt;",
    "node_list": ["node_id_1", "node_id_2", ..., "node_id_n"]
}}
Directly return the final JSON structure. Do not output anything else.
"""

tree_search_result = await call_llm(search_prompt)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2.2 Print retrieved nodes and reasoning process&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Later on, I create a lookup table for the tree by calling utils.create_node_mapping(tree), which gives me a dictionary where each key is a node_id and the value is the corresponding node details; then, since my tree_search_result from the LLM is just a JSON string, I parse it into a Python dictionary with json. loads.&lt;/p&gt;

&lt;p&gt;Next, I print out the model's reasoning process by passing tree_search_result_json['thinking'] into utils.print_wrapped, which formats long text nicely so it's easier to read; after that, I loop through each node_id in tree_search_result_json["node_list"], take the matching node from my node_map, and print its ID, page number, and title, so I can clearly see which parts of the document the LLM thought were relevant to my query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;node_map = utils.create_node_mapping(tree)
tree_search_result_json = json.loads(tree_search_result)

print('Reasoning Process:')
utils.print_wrapped(tree_search_result_json['thinking'])

print('\nRetrieved Nodes:')
for node_id in tree_search_result_json["node_list"]:
    node = node_map[node_id]
    print(f"Node ID: {node['node_id']}\t Page: {node['page_index']}\t Title: {node['title']}")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Answer Generation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;3.1 Extract relevant context from retrieved nodes&lt;br&gt;
Finally, I take the JSON string from tree_search_result and parse it again with json. loads, then take just the "node_list" which contains the IDs of relevant nodes; after that, I build a big string called relevant_content by joining together the "text" fields of each node in that list, separated by two newlines, so it reads cleanly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;node_list = json.loads(tree_search_result)["node_list"]
relevant_content = "\n\n".join(node_map[node_id]["text"] for node_id in node_list)

print('Retrieved Context:\n')
utils.print_wrapped(relevant_content[:1000] + '...')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion :
&lt;/h2&gt;

&lt;p&gt;PageIndex is a new RAG mechanism. The AI explores documents as a table-of-contents tree, ensuring high relevance and providing clear evidence. It does not require a vector database, making it suitable for on-premise environments and for searching confidential documents. PageIndex is especially effective in fields where accuracy is critical, such as contracts, technology, and finance.&lt;/p&gt;

&lt;p&gt;🧙‍♂️ I am an AI Generative expert! If you want to collaborate on a project, drop an inquiry here or book a 1-on-1 Consulting Call With Me.&lt;/p&gt;

&lt;p&gt;I would highly appreciate it if you&lt;/p&gt;

&lt;p&gt;❣ Join my Patreon: &lt;a href="https://www.patreon.com/GaoDalie_AI" rel="noopener noreferrer"&gt;https://www.patreon.com/GaoDalie_AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Book an Appointment with me: &lt;a href="https://topmate.io/gaodalie_ai" rel="noopener noreferrer"&gt;https://topmate.io/gaodalie_ai&lt;/a&gt;&lt;br&gt;
Support the Content (every Dollar goes back into the -video):&lt;a href="https://buymeacoffee.com/gaodalie98d" rel="noopener noreferrer"&gt;https://buymeacoffee.com/gaodalie98d&lt;/a&gt;&lt;br&gt;
Subscribe to the Newsletter for free:&lt;a href="https://substack.com/@gaodalie" rel="noopener noreferrer"&gt;https://substack.com/@gaodalie&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>programming</category>
    </item>
    <item>
      <title>Long Term Memory + RAG + MCP + LangGraph = The Key To Powerful Agentic AI</title>
      <dc:creator>Gao Dalie (Ilyass)</dc:creator>
      <pubDate>Thu, 25 Sep 2025 16:37:43 +0000</pubDate>
      <link>https://dev.to/gaodalie_ai/long-term-memory-rag-mcp-langgraph-the-key-to-powerful-agentic-ai-4k3m</link>
      <guid>https://dev.to/gaodalie_ai/long-term-memory-rag-mcp-langgraph-the-key-to-powerful-agentic-ai-4k3m</guid>
      <description>&lt;p&gt;In this story, I have a super quick tutorial showing you how to create a multi-agent chatbot using LangGraph, MCP, RAG, and long-term memory to build a powerful agent chatbot for your business or personal use.&lt;/p&gt;

&lt;p&gt;This AI agent is the most powerful one I have ever built. You can use RAG to answer questions by looking up information in dictionaries and other documents.&lt;/p&gt;

&lt;p&gt;Just as we answer difficult questions by looking up information in books or on the internet, the MCP server serves as the “hands and feet” of the AI. It uses a human analogy, even if the brain (the AI agent) thinks, “Get me that book,” the book cannot be retrieved unless the hand (MCP) actually moves. The MCP server acts as a bridge that converts the AI’s “thoughts” into actual “actions.”&lt;/p&gt;

&lt;p&gt;One of the big problems with agents is communication. It worked fine at first, but the more I used it, the worse it got. It didn’t learn from past mistakes. It kept making the same mistakes. But with this Powerful AI agent, we solved all the major pain points of the AI community.&lt;/p&gt;

&lt;p&gt;If this is your first time watching me, I highly recommend checking out my previous stories. I created a video about the Latest AI technology, which has become a big hit in the AI community.&lt;/p&gt;

&lt;p&gt;So, let me give you a quick demo of a live chatbot to show you what I mean.&lt;/p&gt;

&lt;p&gt;Check a &lt;a href="https://www.youtube.com/watch?v=cZsHLYAx8Zc" rel="noopener noreferrer"&gt;Video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before I ask the question, I will load a memory. where I have past conversation and I will ask the chatbot a question: “Find the latest information about Large Language Models”&lt;/p&gt;

&lt;p&gt;If you take a look at how the Agent generates the output, you’ll see that the multi-AI agent system we are building uses Google’s generative AI model (Gemini) and is a rudimentary multi-AI agent system in which a web search agent and a file operation agent work together autonomously under a manager agent that interacts with the user and issues instructions to specialised agents.&lt;/p&gt;

&lt;p&gt;Just as humans work in teams, AI agents also work together, utilising their respective areas of expertise.&lt;/p&gt;

&lt;p&gt;The three agents featured in the game are:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Supervisor (Manager)&lt;/strong&gt;: The brains of the team. They understand instructions from users, plan the entire task, decide which worker should do what, when, and give precise instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Web Surfer (Worker)&lt;/strong&gt;: A professional information gatherer. Searches the web using keywords instructed by the Supervisor, gathers the necessary information, and reports it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File Operator (Worker)&lt;/strong&gt;: A master of organisation and record keeping. Follows the Supervisor’s instructions to write information to files and read from existing files.&lt;/p&gt;

&lt;p&gt;By having these agents work together, complex tasks that combine web searches and file operations, such as “find information about any product and compile it into a CSV file,” can be automatically executed with just a single user command.&lt;/p&gt;

&lt;p&gt;In this example, the tasks performed by the specialised agents are limited to web searches and file operations. Still, by increasing the number of specialised agents and assigning them personas and tools, it becomes possible to expand functionality according to the use case flexibly.&lt;/p&gt;

&lt;p&gt;For example, by utilising MCP, it is possible to implement the following agents that function as workers, allowing for the automation of more complex and practical tasks.&lt;/p&gt;

&lt;h1&gt;
  
  
  Let’s start coding :
&lt;/h1&gt;

&lt;p&gt;Let us now explore step by step and unravel the answer to how to create a LangGraph, RAG, MCP, and Long Term Memory. We will install the libraries that support the model. For this, we will do a pip install requirements.&lt;/p&gt;

&lt;p&gt;I would like to inform you that the code I shared here is only a part of my code. If you would like the full folder, you can find it on my Patreon. This code took me a considerable amount of time, and this agent is the most powerful and advanced agent I have built. All the techniques are in my folder.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
pip install -r requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed and perform some basic configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import streamlit as st
import json
import os
import logging
import uuid
import asyncio
import warnings
from dotenv import load_dotenv, find_dotenv
from typing import List, TypedDict

# LangChain and LangGraph components
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage, BaseMessage, ToolMessage, messages_to_dict, messages_from_dict
from langchain_core.utils.function_calling import convert_to_openai_function
from langchain_google_genai import ChatGoogleGenerativeAI
from langgraph.prebuilt import ToolNode
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver
from langchain_mcp_adapters.client import MultiServerMCPClient
mcp_config.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agents use “tools” to perform specific “actions” such as web searches or file operations. This system uses tools via a mechanism called Model-Context-Protocol (MCP). mcp_config.jsonFiles are configuration files that define which tools to launch and how.&lt;/p&gt;

&lt;p&gt;Create a file directly under the project folder, mcp_config.JSON and write the following content in it.&lt;/p&gt;

&lt;p&gt;web-search: Tool server settings for performing web searches. npx Start the Playwright MCP server.&lt;/p&gt;

&lt;p&gt;file-system: Toolserver settings for reading and writing files.&lt;/p&gt;

&lt;p&gt;args/path/to/your/project/multi-agent-system/outputPlease change the part to suit your environment. This is the absolute path of the folder where the file operator agent is allowed to read and write files. For example, output creates a folder in the project and specifies its path. Please note that if you specify the path incorrectly, file operations will not be possible.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    “mcpServers”: {
      “web-search”: {
        “command”: “npx”,
        “args”: [
          “@playwright/mcp@latest”
        ],
        “transport”: “stdio”
      },
      “file-system”: {
        “command”: “npx”,
        “args”: [
          “-y”,
          “@modelcontextprotocol/server-filesystem”,
          “/path/to/your/project/multi-agent-system/output”
        ],
        “transport”: “stdio”
      }
    }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let‘s sanitize_schema to walk dictionaries and lists recursively, remove unwanted keys like additionalProperties and $schema, and normalise a type field that can be a list by selecting the first non-NULL value and uppercasing it, applying the same cleaning to every nested value; I added save_conversation(session_id, messages) which guards against empty inputs, builds a path under CONVERSATION_HISTORY_DIR, converts message objects to plain dictionaries with messages_to_dict, and writes them as UTF-8 JSON with ensure_ascii=False and indent=2;&lt;/p&gt;

&lt;p&gt;I implemented load_conversation(session_id) to return an empty list if the file is missing, otherwise load the JSON and turn it back into message objects with messages_from_dict, returning an empty list on JSONDecodeError or TypeError to fail gracefully;&lt;/p&gt;

&lt;p&gt;I built list_conversations() to scan the directory for .json files, pull each file’s modification time, load its messages, pick the first human message that isn’t an internal instruction to use as a short title (truncated with an ellipsis at 40 characters), and collect {id, title, mtime} entries while skipping files that have errors, and finally sorted the list by modification time descending, and I added delete_conversation(session_id) to safely remove the corresponding JSON file if it exists.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def sanitize_schema(item):
    “”“Sanitize MCP tool schema for LangChain compatibility”“”
    if isinstance(item, dict):
        item.pop(’additionalProperties’, None)
        item.pop(’$schema’, None)
        if ‘type’ in item and isinstance(item[’type’], list):
            non_null_types = [t for t in item[’type’] if str(t).upper() != ‘NULL’]
            item[’type’] = str(non_null_types[0]).upper() if non_null_types else None
        for key, value in item.items():
            item[key] = sanitize_schema(value)
    elif isinstance(item, list):
        return [sanitize_schema(i) for i in item]
    return item

def save_conversation(session_id: str, messages: List[BaseMessage]):
    “”“Save conversation to JSON file”“”
    if not session_id or not messages:
        return
    file_path = os.path.join(CONVERSATION_HISTORY_DIR, f”{session_id}.json”)
    with open(file_path, “w”, encoding=”utf-8”) as f:
        json.dump(messages_to_dict(messages), f, ensure_ascii=False, indent=2)

def load_conversation(session_id: str) -&amp;gt; List[BaseMessage]:
    “”“Load conversation from JSON file”“”
    file_path = os.path.join(CONVERSATION_HISTORY_DIR, f”{session_id}.json”)
    if not os.path.exists(file_path):
        return []
    with open(file_path, “r”, encoding=”utf-8”) as f:
        try:
            data = json.load(f)
            return messages_from_dict(data)
        except (json.JSONDecodeError, TypeError):
            return []

def list_conversations() -&amp;gt; List[dict]:
    “”“Get list of saved conversations”“”
    conversations = []
    for filename in os.listdir(CONVERSATION_HISTORY_DIR):
        if filename.endswith(”.json”):
            session_id = filename[:-5]
            file_path = os.path.join(CONVERSATION_HISTORY_DIR, filename)
            try:
                mtime = os.path.getmtime(file_path)
                messages = load_conversation(session_id)
                # Get first user message as conversation title
                first_user_message = next(
                    (m.content for m in messages
                     if isinstance(m, HumanMessage) and m.additional_kwargs.get(”role”) != “internal_instruction”),
                    “New conversation”
                )
                title = first_user_message[:40] + “...” if len(first_user_message) &amp;gt; 40 else first_user_message
                conversations.append({”id”: session_id, “title”: title, “mtime”: mtime})
            except Exception:
                continue
    conversations.sort(key=lambda x: x[”mtime”], reverse=True)
    return conversations

def delete_conversation(session_id: str):
    “”“Delete conversation file”“”
    file_path = os.path.join(CONVERSATION_HISTORY_DIR, f”{session_id}.json”)
    if os.path.exists(file_path):
        os.remove(file_path)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I made a pair of helper functions to spin up a worker agent and a supervisor agent that coordinate tasks in a multi-agent setup: I wrote create_worker to take a language model, a list of tools, and a system prompt, then build a ChatPromptTemplate consisting of a system role message plus a placeholder for past conversation history, and finally, return a pipeline that connects this prompt with the LLM bound to its tools;&lt;/p&gt;

&lt;p&gt;I then built create_supervisor to orchestrate the workers by defining a long system prompt that explains the manager’s responsibilities—analysing the user’s request, breaking it into subtasks, deciding which worker acts next, passing along prior results for continuity, finishing when complete, and retrying if a worker fails—while also dynamically listing the available workers;&lt;/p&gt;

&lt;p&gt;I created an output_schema that forces the supervisor to respond with a structured object containing a next field (worker name or FINISH) and a content field (instructions or final user response); and finally, I constructed a ChatPromptTemplate for the supervisor, then bound the LLM with this schema using bind_tools and tool_choice=”supervisor_decision”, returning both the prompt and the configured LLM so they can drive the agent loop together.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def create_worker(llm: ChatGoogleGenerativeAI, tools: list, system_prompt: str):
    “”“Create a worker agent with specific role”“”
    prompt = ChatPromptTemplate.from_messages([
        (”system”, system_prompt),
        MessagesPlaceholder(variable_name=”messages”),
    ])
    return prompt | llm.bind_tools(tools)

def create_supervisor(llm: ChatGoogleGenerativeAI, worker_names: List[str]):
    “”“Create supervisor that manages tasks and directs workers”“”
    system_prompt = (
        “You are the manager of an AI team. Your job is to supervise your worker team to achieve user requests.\\n”
        “Carefully review the entire conversation history (user requests, workers’ previous results, etc.).\\n\\n”
        “Follow these steps:\\n”
        “1. **Task Analysis**: Consider the steps needed to fulfill the user’s request. Multiple workers may need to collaborate. “
        “For example, ‘WebSurfer’ collects information that ‘FileOperator’ writes to a file.\\n”
        “2. **Decide Next Action**: Based on analysis, determine the next action:\\n”
        “   - **Worker Instructions**: When assigning a task to a worker, specify the worker name in `next` and detailed instructions in `content`. “
        “**Important: Include previous workers’ output results in the next worker’s instructions.** This enables information flow between workers.\\n”
        “   - **Direct User Response**: When all tasks are complete or for simple responses not requiring workers, “
        “set `next` to ‘FINISH’ and provide the final response in `content`.\\n”
        “   - **Recovery from Failure**: If a worker fails, review conversation history, modify instructions and retry, or try a different approach.\\n\\n”
        f”Available workers:\\n{chr(10).join(f’- {name}’ for name in worker_names)}”
    )

    output_schema = {
        “title”: “supervisor_decision”,
        “type”: “object”,
        “properties”: {
            “next”: {”type”: “string”, “description”: f”Next worker name ({’, ‘.join(worker_names)} or FINISH)”},
            “content”: {”type”: “string”, “description”: “Instructions for worker or final response to user”}
        },
        “required”: [”next”, “content”]
    }

    prompt = ChatPromptTemplate.from_messages([
        (”system”, system_prompt),
        MessagesPlaceholder(variable_name=”messages”),
    ])

    llm_with_tool = llm.bind_tools(tools=[output_schema], tool_choice=”supervisor_decision”)
    return prompt, llm_with_tool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I developed a supervisor_node function to serve as the brain of the supervisor agent, guiding the workflow and recording its reasoning: I started by logging that the supervisor node is running, then built a chain by piping the supervisor_prompt into the supervisor_llm and invoked it with the current conversation history (state[”messages”]);&lt;/p&gt;

&lt;p&gt;I extracted the usage_metadata from the response to calculate and log the cost of running the supervisor model; I then pulled out the supervisor’s structured decision from the first tool call, capturing both the content (instructions or final message) and the next action (worker name or FINISH), and printed a debug statement with those values; I created an AIMessage that reflects the supervisor’s reasoning, formatting it as an instruction when directing a worker, and as plain content when finishing.&lt;/p&gt;

&lt;p&gt;If the decision wasn’t FINISHI generated an internal HumanMessage flagged with role=”internal_instruction” to pass along to the worker, returning an updated state with both the supervisor’s comment and the worker’s instruction, along with the next action, but if the decision was FINISHI just appended the supervisor’s comment and returned the state with next=” FINISH”&lt;/p&gt;

&lt;p&gt;Finally, I wrapped everything in a try/except block to catch errors, log them, and gracefully return an error AIMessage with next=” FINISH”, So the flow doesn’t break.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def supervisor_node(state: AgentState):
        “”“Supervisor node that decides what to do next and records its thinking”“”
        logger.info(”--- Supervisor Node ---”)

        try:
            chain = supervisor_prompt | supervisor_llm
            response_message = chain.invoke({”messages”: state[”messages”]})

            # Calculate and log costs
            usage_metadata = response_message.response_metadata.get(”usage_metadata”, {})
            costs = calculate_cost(usage_metadata, supervisor_model_name)
            logger.info(f”Cost (Supervisor): ${costs[’total’]:.6f}”)

            # Extract supervisor decision
            tool_call = response_message.tool_calls[0]
            supervisor_output = tool_call[’args’]
            logger.info(f”Supervisor Decision: {supervisor_output}”)

            content = supervisor_output.get(”content”, “”)
            next_action = supervisor_output.get(”next”, “FINISH”)

            print(f”DEBUG Supervisor Decision: next=’{next_action}’, content=’{content}’”)

            # Create supervisor’s thinking message for UI
            supervisor_comment_content = content if next_action == “FINISH” else f”【Instruction to {next_action}】\\n{content}”
            supervisor_comment = AIMessage(content=supervisor_comment_content, name=”Supervisor”)

            if next_action != “FINISH”:
                # Internal instruction for worker
                instruction_for_worker = HumanMessage(
                    content=content,
                    additional_kwargs={”role”: “internal_instruction”}
                )
                return {
                    “messages”: state[”messages”] + [supervisor_comment, instruction_for_worker],
                    “next”: next_action
                }
            else:
                return {
                    “messages”: state[”messages”] + [supervisor_comment],
                    “next”: next_action
                }
        except Exception as e:
            logger.error(f”Supervisor error: {e}”)
            error_response = AIMessage(content=f”I encountered an error while processing your request: {str(e)}”, name=”Supervisor”)
            return {”messages”: state[”messages”] + [error_response], “next”: “FINISH”}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I made a worker_node and its supporting routing logic to let workers execute tasks, call tools, and feed results back into the multi-agent loop with robust error handling: I began worker_node by looking up the assigned worker’s name from state[”next”], logging which worker is running, and invoking the worker with the conversation history while enforcing a recursion limit of 10; I added debug prints to show whether the worker response included tool calls and which tools were triggered; I wrapped cost calculation in a safe try/except, logging the model’s cost when usage_metadata was available and warned otherwise;&lt;/p&gt;

&lt;p&gt;I checked whether the response carried meaningful content or tool_calls, and if neither was present, I replaced it with an apologetic fallback AIMessage So the system never returns empty output; I also ensured the response carried the correct worker name and appended it to the message history in the returned state; I surrounded the entire block in a try/except so any exception is caught and turned into an error message from the worker instead of crashing.&lt;/p&gt;

&lt;p&gt;Then I created _tool_node as a ToolNode(tools) instance and wrapped it in an async custom_tool_node that executes tool calls via ainvoke and appends results back into the state; finally, I defined routing helpers: after_worker_router, which checks if the worker’s last message included tool calls and routes either to “tools” or back to the “supervisor”, and supervisor_router, which inspects the supervisor’s next decision and routes either to the specified worker or to END If no further action is required.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def worker_node(state: AgentState):
        “”“Worker node that executes assigned tasks with error handling”“”
        worker_name = state[”next”]
        worker = workers[worker_name]
        logger.info(f”--- Worker Node: {worker_name} ---”)

        try:

            response = worker.invoke({”messages”: state[’messages’]}, {”recursion_limit”:10})

            print(f”DEBUG Worker {worker_name} response has tool_calls: {hasattr(response, ‘tool_calls’) and bool(response.tool_calls)}”)
            if hasattr(response, ‘tool_calls’) and response.tool_calls:
                print(f”DEBUG Tool calls: {[tc[’name’] for tc in response.tool_calls]}”)
            # Calculate costs safely
            try:
                usage_metadata = response.response_metadata.get(”usage_metadata”, {})
                costs = calculate_cost(usage_metadata, worker_model_name)
                logger.info(f”Cost ({worker_name}): ${costs[’total’]:.6f}”)
            except:
                logger.warning(”Could not calculate costs”)

            # Check if response has content or tool calls
            has_content = bool(response.content)
            has_tool_calls = hasattr(response, ‘tool_calls’) and bool(response.tool_calls)

            if not has_content and not has_tool_calls:
                error_message = “I apologize, but I encountered a technical issue and couldn’t complete the task. Please try rephrasing your request.”
                response = AIMessage(content=error_message, name=worker_name)

            # Ensure response has a name
            response.name = worker_name
            return {”messages”: state[”messages”] + [response]}

        except Exception as e:
            logger.error(f”Worker {worker_name} exception: {e}”)
            error_message = “I encountered an error while processing your request. Please try again or rephrase your question.”
            error_response = AIMessage(content=error_message, name=worker_name)
            return {”messages”: state[”messages”] + [error_response]}

    # Tool execution node
    _tool_node = ToolNode(tools)

    async def custom_tool_node(state: AgentState):
        “”“Node that executes tools called by workers”“”
        tool_results = await _tool_node.ainvoke(state)
        return {”messages”: state[”messages”] + tool_results[”messages”]}

    # --- Routing Functions ---

    def after_worker_router(state: AgentState) -&amp;gt; str:
        “”“Router that decides where to go after worker execution”“”
        last_message = state[”messages”][-1]
        if hasattr(last_message, “tool_calls”) and last_message.tool_calls:
            return “tools”
        return “supervisor”

    def supervisor_router(state: AgentState) -&amp;gt; str:
        “”“Router that decides where to go after supervisor decision”“”
        next_val = state.get(”next”)
        if not next_val or next_val == “FINISH”:
            return END
        return next_val
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I made a workflow orchestration graph that connects the supervisor, workers, and tools into a single state machine: I started by initialising a StateGraph with the AgentState type, then added nodes for the supervisor, the tools, and each worker dynamically by looping over the workers dictionary;&lt;/p&gt;

&lt;p&gt;I set up conditional edges for each worker using after_worker_router, so that after completing a task, the flow either routes to “tools” If tool calls are present or back to “supervisor” Otherwise, I defined a direct edge from “tools” back to “supervisor” To ensure tool results are always reviewed, I then configured the supervisor’s routing with supervisor_router,&lt;/p&gt;

&lt;p&gt;So its decisions can branch to specific workers or end the workflow when tasks are complete; I marked the supervisor as the entry point by adding an edge from START to “supervisor”, ensuring all requests begin under its control; finally, I compiled the workflow with a MemorySaver Checkpointer to persist conversation state across steps, returned the resulting app, and logged that graph initialisation had completed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;workflow = StateGraph(AgentState)

    # Add nodes
    workflow.add_node(”supervisor”, supervisor_node)
    workflow.add_node(”tools”, custom_tool_node)
    for name in workers:
        workflow.add_node(name, worker_node)

    # Add conditional edges for workers
    for name in workers:
        workflow.add_conditional_edges(
            name,
            after_worker_router,
            {”tools”: “tools”, “supervisor”: “supervisor”}
        )

    # Tools always return to supervisor
    workflow.add_edge(”tools”, “supervisor”)

    # Supervisor conditional routing
    workflow.add_conditional_edges(
        “supervisor”,
        supervisor_router,
        {**{name: name for name in workers}, END: END}
    )

    # Start with supervisor
    workflow.add_edge(START, “supervisor”)

    # Compile with memory
    memory = MemorySaver()
    app = workflow.compile(checkpointer=memory)

    logger.info(”Graph initialization completed.”)
    return app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Conclusion :
&lt;/h1&gt;

&lt;p&gt;The combination of AI agents is expected to change the way we work and conduct business dramatically. The role of AI will change dramatically from its current role as a “teacher” to a “reliable partner,” and even “someone who acts on our behalf.”&lt;/p&gt;

&lt;p&gt;The following abilities are considered particularly important for making effective use of AI agents:&lt;/p&gt;

&lt;p&gt;“The power to ask questions”: The ability to define problems and give clear instructions and requirements to AI.&lt;/p&gt;

&lt;p&gt;“Ability to confirm and decide”: The ability to evaluate the results generated by AI and make final decisions.&lt;/p&gt;

&lt;p&gt;“Ability to assign work through multitasking”: The ability to appropriately use multiple AI agents and allocate tasks efficiently.&lt;/p&gt;

&lt;p&gt;🧙‍♂️ I am an AI Generative expert! If you want to collaborate on a project, drop an inquiry here or book a 1-on-1 Consulting Call With Me.&lt;/p&gt;

&lt;p&gt;I would highly appreciate it if you&lt;/p&gt;

&lt;p&gt;❣ Join my Patreon: &lt;a href="https://www.patreon.com/GaoDalie_AI" rel="noopener noreferrer"&gt;https://www.patreon.com/GaoDalie_AI&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Book an Appointment with me: &lt;a href="https://topmate.io/gaodalie_ai" rel="noopener noreferrer"&gt;https://topmate.io/gaodalie_ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Support the Content (every Dollar goes back into the -video):&lt;a href="https://buymeacoffee.com/gaodalie98d" rel="noopener noreferrer"&gt;https://buymeacoffee.com/gaodalie98d&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Subscribe to the Newsletter for free:&lt;a href="https://substack.com/@gaodalie" rel="noopener noreferrer"&gt;https://substack.com/@gaodalie&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>programming</category>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>ai</category>
    </item>
    <item>
      <title>How I built Nano Banana AI Image Editing Agent</title>
      <dc:creator>Gao Dalie (Ilyass)</dc:creator>
      <pubDate>Thu, 04 Sep 2025 15:50:32 +0000</pubDate>
      <link>https://dev.to/gaodalie_ai/how-i-built-nano-banana-ai-image-editing-agent-3ec1</link>
      <guid>https://dev.to/gaodalie_ai/how-i-built-nano-banana-ai-image-editing-agent-3ec1</guid>
      <description>&lt;p&gt;Recently, I’ve been working on a personal development project to create a service that handles multiple images, and I wanted to make the image generation workflow smoother. It’s a pain to have to generate images in a separate tool, download them, and then incorporate them into the code.&lt;/p&gt;

&lt;p&gt;Suddenly, a new wind is blowing in the world of image generation AI called “nano-banana”. An unidentified image-generating AI suddenly appeared on a comparison site for AI models called LMArena.&lt;/p&gt;

&lt;p&gt;With no official announcements, it remains shrouded in mystery, but its high level of accuracy has caused quite a stir in the AI community.&lt;/p&gt;

&lt;p&gt;In the world of image generation AI, well-known models such as DALL-E, Midjourney, and Stable Diffusion have long dominated the market. However, the emergence of nano-banana is about to change this landscape dramatically.&lt;/p&gt;

&lt;p&gt;What’s particularly noteworthy about nano-banana is its consistency and editing capabilities. It effectively maintains character across multiple images and handles complex image editing, a task that previous models struggled to achieve.&lt;/p&gt;

&lt;p&gt;When I actually used it, I found that although the prompts needed some ingenuity, it generated images that suited my usage scenarios with a fairly high degree of accuracy. So, I thought it would be useful to incorporate this into my development environment&lt;/p&gt;

&lt;p&gt;So, let me give you a quick demo of a live chatbot to show you what I mean.&lt;/p&gt;

&lt;p&gt;Check &lt;a href="https://www.youtube.com/watch?v=0A4pBdHMYNM&amp;amp;list=PLe0lDFxNR_cgBus1ScFFWFU3sH_oMQ0vp" rel="noopener noreferrer"&gt;video&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;If you take a look at how the chatbot generates the output, you’ll see that the AI agent captures the user input and adds it to the session state message history, then constructs an API request by combining the system prompt (which instructs the agent to generate images and describe changes), the user’s text, and any uploaded reference image.&lt;/p&gt;

&lt;p&gt;This request is sent to Google’s Gemini 2.5 Flash Image Preview model, configured to return both text descriptions and generated images. The model processes the prompt to extract visual concepts, objects, and styles, then generates visual content by transforming the instructions into a coherent image based on both the text prompt and any visual inputs.&lt;/p&gt;

&lt;p&gt;It returns a structured response where the unpack_response function extracts text into full_response and converts binary image data into a PIL Image object, which is then displayed in the Streamlit chat interface alongside the generated description and stored in the session state message history, creating a persistent conversational record where users can reference previous generations and build upon them iteratively&lt;/p&gt;

&lt;h1&gt;
  
  
  Why this is a game changer
&lt;/h1&gt;

&lt;p&gt;While conventional AI image generation has tended to rely on a “try a few times to find the right one,” Gemini 2.5 (nano-banana) excels at “ maintaining consistency even when the composition changes “ and “ finishing the image by correcting only the targeted areas .”&lt;/p&gt;

&lt;p&gt;Localised editing, compositing multiple images, copying styles, and maintaining consistency of subjects can all be done on the same model, dramatically reducing the number of trials.&lt;/p&gt;

&lt;p&gt;By reducing the number of trial and error trials and errors on the original drawings and rough sketches in the pre-production stage and producing a more refined product, the need for rework in the post-production stage can be reduced.&lt;/p&gt;

&lt;p&gt;In the workplace, the chances of a plan remaining just a plan will decrease, and the option of actually creating it will become more realistic. Also, depending on the product, a dramatic increase in production speed is expected.&lt;/p&gt;

&lt;h1&gt;
  
  
  Features
&lt;/h1&gt;

&lt;p&gt;Gemini 2.5 Flash Image (formerly nano-banana) is the latest image generation and editing model developed by Google. This model can perform advanced image generation and editing using only text instructions, and also supports editing and compositing of existing images.&lt;/p&gt;

&lt;h3&gt;
  
  
  High-speed image generation:
&lt;/h3&gt;

&lt;p&gt;Images can be generated in just a few seconds per image, significantly faster than competing models. It is also cost-effective.&lt;br&gt;
Designed specifically for image&lt;br&gt;
editing: This powerful app lets you change backgrounds and people’s expressions simply by sending text commands. It also supports editing tasks like blurring backgrounds, erasing people, changing poses, and colorizing black-and-white photos. It also faithfully responds to multi-step commands (multiple times) within the same chat session.&lt;/p&gt;
&lt;h3&gt;
  
  
  Maintaining character consistency:
&lt;/h3&gt;

&lt;p&gt;Maintains facial features, body shapes, clothing, etc., of people and characters with high accuracy. Effective for generating and editing a series of images.&lt;br&gt;
Fusion and composition of multiple images:&lt;br&gt;
It is possible to combine an input image with another image scene or to create a new fused image by combining elements of multiple images.&lt;/p&gt;
&lt;h3&gt;
  
  
  Gemini Knowledge Integration:
&lt;/h3&gt;

&lt;p&gt;Leveraging the world knowledge and logical inference capabilities of Google’s large-scale language model “Gemini,” the system generates images with semantic consistency. It also demonstrates excellent performance in accurately reproducing text and logos, expressing factual details, and reading diagrams.&lt;br&gt;
Digital watermark embedding: A digital watermark using&lt;br&gt;
SynthID is automatically embedded in the output image, making it possible to later identify it as an AI-generated image.&lt;/p&gt;
&lt;h1&gt;
  
  
  Let’s Start Coding
&lt;/h1&gt;

&lt;p&gt;Let us now explore step by step and unravel the answer to how to build a Nano Banana AI Image Editing Agent. We will install the libraries that support the model. For this, we will do a pip install requirements&lt;/p&gt;

&lt;p&gt;pip install requirements&lt;br&gt;
Once installed, we import the important dependencies like streamlit, google, io and PIL&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
import streamlit as st
from io import StringIO
from dotenv import load_dotenv

import os
from io import BytesIO
from PIL import Image
from google import genai
from google.genai import types
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s set the page configuration with a custom title and sidebar, then I made a dictionary to hold avatars for the assistant and the user so their messages look unique, and I created styled headings in the main page using HTML with custom colors .&lt;/p&gt;

&lt;p&gt;I also added a sidebar with a banana emoji title for fun, and I initialized the session state so the chatbot remembers past messages, starting with a default greeting from the assistant.&lt;/p&gt;

&lt;p&gt;Then, I created a loop that displays each stored message with the right avatar and content, and if the assistant sends an image, I made sure it is displayed directly under the message, giving the app an interactive and conversational feel.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;st.set_page_config(page_title='Gemini Nano Banana Chatbot', 
                    initial_sidebar_state='auto')

background_color = "#252740"

avatars = {
    "assistant": "🤖",
    "user": "👤"
}

st.markdown("&amp;lt;h2 style='text-align: center; color: #3184a0;'&amp;gt;Gemini Nano Banana&amp;lt;/h2&amp;gt;", unsafe_allow_html=True)
st.markdown("&amp;lt;h3 style='text-align: center; color: #3184a0;'&amp;gt;Image generator chatbot&amp;lt;/h3&amp;gt;", unsafe_allow_html=True)

with st.sidebar:
    st.markdown("### 🍌 Gemini Nano Banana")

if "messages" not in st.session_state.keys():
    st.session_state.messages = [
        {"role": "assistant", "content": "How may I assist you today?", "image": None}
    ]

for message in st.session_state.messages:
    with st.chat_message(message["role"], 
                         avatar=avatars[message["role"]]):
        st.write(message["content"])
        if message["role"] == "assistant" and message["image"]:
            st.image(message["image"])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I developed a function called clear_chat_history that resets the conversation by replacing the session state messages with a single default assistant greeting, and then I connected this function to a "Clear Chat History" button in the sidebar so users can restart the chat whenever they want.&lt;/p&gt;

&lt;p&gt;I also added a file uploader inside the sidebar that lets users upload images in JPG, JPEG, or PNG format, and once an image is uploaded, I made sure it gets opened with Image.open, saved into the session state for later use, and immediately displayed in the sidebar with a caption so users can see the image they just uploaded.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def clear_chat_history():
    st.session_state.messages = [
        {"role": "assistant", "content": "How may I assist you today?", "image": None}
    ]

st.sidebar.button("Clear Chat History", on_click=clear_chat_history)

with st.sidebar:
    uploaded_file = st.file_uploader("Upload an image", type=["jpg", "jpeg", "png"])

    if uploaded_file:
        image_bytes = Image.open(uploaded_file)
        st.session_state.image = image_bytes
        st.image(image_bytes, caption="Uploaded Image", use_container_width=True)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, I created a function run_query that lets the Agent send a request to Google’s Gemini API to generate text and images from what the user inputs. I started by loading the environment variables to safely get the key GEMINI_API_KEY and then set up the API client with that key.&lt;/p&gt;

&lt;p&gt;I wrote a system prompt that clearly tells the model to generate an image and a short text describing any changes. Then I put together a contents list that includes the user’s input and the uploaded image, if there is one, or just the input if not. I called client.models.generate_content using the "gemini-2.5-flash-image-preview" model, set it to return both text and images, and finally, made the function return the model’s response or "Error" If something goes wrong.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def run_query(input_text):
    try:
        load_dotenv()
        GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
        os.environ["GEMINI_API_KEY"] = GEMINI_API_KEY
        client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
        system_prompt = """
        #INSTRUCTIONS
        Generate an image according to the instructions. 
        Specify in the output text the changes made to the image
        #OUTPUT
        A generated image and a short text
        """

        if "image" in st.session_state and st.session_state.image:
            contents = [system_prompt, input_text, st.session_state.image]
        else:
            contents = [system_prompt, input_text]

        response = client.models.generate_content(
            model="gemini-2.5-flash-image-preview",
            contents=contents,
            config=types.GenerateContentConfig(
                response_modalities=['Text', 'Image']
            )
        )

        if response:
            return response
        else:
            return "Error"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, I built a function called unpack_response that takes what the user types, sends it to the Gemini model, and then separates the text and image that the model creates. I set up a placeholder so we could update the output dynamically, started with an empty string for the text, and created a variable to hold the image.&lt;/p&gt;

&lt;p&gt;If something goes wrong, the function returns an error message, but normally it loops through the response: any text is added to the response string, and any image data is opened so it can be shown. To make the chat feel real,&lt;/p&gt;

&lt;p&gt;I used st.chat_input so users can type messages, display their message with a 👤 avatar, then show the assistant’s reply with a loading spinner, including both the text and image if the model generated one. Finally, I saved the assistant’s reply in the session state so the whole conversation stays visible and interactive.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def unpack_response(prompt):
    response = run_query(prompt)

    placeholder = st.empty()
    full_response = ""
    generated_image = None

    # Handle error responses
    if isinstance(response, str) and "Error" in response:
        return response, placeholder, None

    try:
        for part in response.candidates[0].content.parts:
            if part.text is not None:
                for item in part.text:
                    full_response += item
            elif part.inline_data is not None:
                generated_image = Image.open(BytesIO(part.inline_data.data))
    except Exception as ex:
        full_response = f"ERROR in unpack response: {str(ex)}"
        generated_image = st.session_state.image if "image" in st.session_state else None

    return full_response, placeholder, generated_image

output = st.empty()
if prompt := st.chat_input():
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user", avatar=avatars["user"]):
        st.write(prompt)

if st.session_state.messages[-1]["role"] != "assistant":
    with st.chat_message("assistant", avatar=avatars["assistant"]):
        with st.spinner("Thinking..."):

            full_response, placeholder, generated_image = unpack_response(prompt)
            if full_response:
                st.write(full_response)
            if generated_image:
                st.image(generated_image)

    message = {"role": "assistant", 
               "content": full_response,
               "avatar": avatars["assistant"],
               "image": generated_image}
    st.session_state.messages.append(message)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Conclusion :
&lt;/h1&gt;

&lt;p&gt;The arrival of nano-banana marks a major turning point in the image generation AI industry, with its overwhelming performance and innovative features opening up new possibilities for the creative industry.&lt;/p&gt;

&lt;p&gt;Gemini 2.5 (nano-banana) is not just an AI that helps you create things better, but has the potential to change the production process itself. It would be a good idea to think about business on the assumption that such functions will continue to develop and allow you to achieve your goals more perfectly.&lt;/p&gt;

&lt;p&gt;🧙‍♂️ I am an AI Generative expert! If you want to collaborate on a project, drop an inquiry here or Book a 1-on-1 Consulting Call With Me.&lt;/p&gt;

&lt;p&gt;I would highly appreciate it if you&lt;/p&gt;

&lt;p&gt;❣ Join my Patreon: &lt;a href="https://www.patreon.com/GaoDalie_AI" rel="noopener noreferrer"&gt;https://www.patreon.com/GaoDalie_AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Book an Appointment with me: &lt;a href="https://topmate.io/gaodalie_ai" rel="noopener noreferrer"&gt;https://topmate.io/gaodalie_ai&lt;/a&gt;&lt;br&gt;
Support the Content (every Dollar goes back into the -video):&lt;a href="https://buymeacoffee.com/gaodalie98d" rel="noopener noreferrer"&gt;https://buymeacoffee.com/gaodalie98d&lt;/a&gt;&lt;br&gt;
Subscribe to the Newsletter for free:&lt;a href="https://substack.com/@gaodalie" rel="noopener noreferrer"&gt;https://substack.com/@gaodalie&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>ai</category>
    </item>
    <item>
      <title>LangExtract + Knowledge Graph— Google’s New Library for NLP Tasks</title>
      <dc:creator>Gao Dalie (Ilyass)</dc:creator>
      <pubDate>Fri, 22 Aug 2025 04:29:29 +0000</pubDate>
      <link>https://dev.to/gaodalie_ai/langextract-knowledge-graph-googles-new-library-for-nlp-tasks-kd6</link>
      <guid>https://dev.to/gaodalie_ai/langextract-knowledge-graph-googles-new-library-for-nlp-tasks-kd6</guid>
      <description>&lt;p&gt;In this story, I have a super quick tutorial showing you how to create a Knowledge graph and LangExtract to build a powerful chatbot for your business or personal use.&lt;/p&gt;

&lt;p&gt;In today’s data-driven world, much valuable information is hidden in unstructured text — for example, clinical records, lengthy legal contracts, or user feedback threads. Extracting meaningful and traceable information from these documents has always been a dual challenge, both technically and practically.&lt;/p&gt;

&lt;p&gt;On July 30, 2025, Google released the open-source AI program LangExtract. This tool accurately extracts only the necessary information from the types of text we read every day, such as emails, reports, and medical records, and organises it into a format that is easy for computers to process.&lt;/p&gt;

&lt;p&gt;While AI is very useful, it also has weaknesses, such as generating false stories (hallucinations), providing incorrect information, having a limited amount of information it can retain at one time, and sometimes giving different answers each time.&lt;/p&gt;

&lt;p&gt;LangExtract was created as a “smart bridge” to compensate for these weaknesses of AI and transform AI’s ability to understand text into the ability to extract reliable information.&lt;/p&gt;

&lt;p&gt;So, let me give you a quick demo of a live chatbot to show you what I mean.&lt;/p&gt;

&lt;p&gt;Check the &lt;a href="https://www.youtube.com/watch?v=r1WJiAtCYj0&amp;amp;t=44s" rel="noopener noreferrer"&gt;video&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I will ask the chatbot a question: “Apple Inc. was founded by Steve Jobs and Steve Wozniak in 1976. The company is headquartered in Cupertino, California. Steve Jobs served as CEO until he died in 2011.”&lt;/p&gt;

&lt;p&gt;If you take a look at how the Agent generates the output, you’ll see that the agent extracts entities using the document_extractor_tool which leverages LangExtract with dynamic few-shot learning examples that automatically select appropriate extraction templates based on query keywords — when the system detects keywords like “financial,” “revenue,” or “company,” it applies business-focused examples that properly classify entities as company names, people, locations, and dates rather than generic categories.&lt;/p&gt;

&lt;p&gt;The entity extraction process runs in parallel with relationship extraction, where the system identifies connections between entities such as “founded by,” “headquartered in,” and “competes with” relationships by analyzing the contextual information within each document.&lt;/p&gt;

&lt;p&gt;Once both entities and relationships are extracted, the build_graph_data function constructs a graph structure, creating nodes for each unique entity and edges for each discovered relationship, with a robust fallback mechanism that ensures connectivity by creating "related_to" edges between all entities when explicit relationships aren't found.&lt;/p&gt;

&lt;p&gt;and The final visualization layer uses Streamlit Agraph to render an interactive knowledge graph where users can explore the connections between companies, founders, locations, and other business entities, with the entire system operating in-memory without file operations and providing real-time debugging information to show the number of entities and relationships discovered, ultimately enabling users to query the knowledge graph and receive filtered results based on their specific questions about the technology companies and their interconnections.&lt;/p&gt;

&lt;h1&gt;
  
  
  What is LangExtract?
&lt;/h1&gt;

&lt;p&gt;LangExtract is a publicly available, Google’s latest open source feature that might finally bring sanity back to developers and data teams.&lt;/p&gt;

&lt;p&gt;This tool doesn’t just “use AI to extract information.” It combines each extraction with the original text. LangExtract acts as a “special mechanism” built on top of LLM to maximise its capabilities by addressing challenges AI faces in information extraction, such as hallucination, imprecision, limited context windows, and nondeterminism.&lt;/p&gt;

&lt;h1&gt;
  
  
  What’s special about LangExtract?
&lt;/h1&gt;

&lt;p&gt;The core strength of LangExtract lies in its “programmatic extraction” capability — it not only identifies the required information precisely but also links each extracted result to the exact character position (offset) in the original text. This traceability allows users to highlight and verify results, significantly improving data reliability interactively.&lt;/p&gt;

&lt;p&gt;LangExtract comes with a range of powerful features: it can process long documents with millions of tokens efficiently through chunking, parallel computation, and multi-pass extraction to ensure high recall. It produces structured outputs directly, eliminating the need for traditional RAG workflows such as chunking and embeddings.&lt;/p&gt;

&lt;p&gt;It is also compatible with both cloud-based models (like Gemini) and local open-source large models, making it highly adaptable. In addition, it supports custom prompt templates, allowing easy adaptation to different domains.&lt;/p&gt;

&lt;h1&gt;
  
  
  Let’s start coding
&lt;/h1&gt;

&lt;p&gt;Let us now explore step by step and unravel the answer to how to create a graph with the langExtract chatbot. We will install the libraries that support the model. For this, we will do a pip install requirements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install -r requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed.&lt;/p&gt;

&lt;p&gt;langextract is a Python library that uses LLMs to extract structured information from unstructured text documents based on user-defined instructions.&lt;/p&gt;

&lt;p&gt;streamlit_agraph: is a custom component for the Streamlit framework, designed specifically for creating interactive graphs&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import os
import textwrap
import langextract as lx
import logging
import streamlit as st
from streamlit_agraph import Config, Edge, Node, agraph
from typing import List, Dict, Any, Optional
import json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s create the function document_extractor_tool that takes two strings: the unstructured_text and the user_query. The function returns a Python dictionary, making it easy to convert into JSON later. Inside, we first build a clean prompt textwrap.dedent(...) where you tell the model its role (an expert extractor), the task (pull out relevant info), and the specific query to focus on.&lt;/p&gt;

&lt;p&gt;Next, we prepare “few-shot” examples to guide the extractor. Based on the query, you check for keywords: if it’s financial, it provides a company/revenue example; if it’s legal, you give a contract example; if it’s social/restaurant, it provides a feedback example; otherwise, it uses a generic Romeo/Juliet example. These short examples demonstrate how the model should process the extractions and ensure the output structure is clear.&lt;/p&gt;

&lt;p&gt;Finally, you call lx.extract(...), passing the text, prompt, examples, and an API key stored safely in an environment variable. You log the results for debugging, then normalise the output so each extraction is a plain dictionary with "text", "class", and "attributes".&lt;/p&gt;

&lt;p&gt;The function returns a single dictionary containing all extracted data in a clean, structured format, ready to be saved, printed, or sent to another system.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def document_extractor_tool(unstructured_text: str, user_query: str) -&amp;gt; dict:
    """
    Extracts structured information from a given unstructured text based on a user's query.
    """
    prompt = textwrap.dedent(f"""
    You are an expert at extracting specific information from documents.
    Based on the user's query, extract the relevant information from the provided text.
    The user's query is: "{user_query}"
    Provide the output in a structured JSON format.
    """)

    # Dynamic Few-Shot Example Selection
    examples = []
    query_lower = user_query.lower()
    if any(keyword in query_lower for keyword in ["financial", "revenue", "company", "fiscal"]):
        financial_example = lx.data.ExampleData(
            text="In Q1 2023, Innovate Inc. reported a revenue of $15 million.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="company_name",
                    extraction_text="Innovate Inc.",
                    attributes={"name": "Innovate Inc."},
                ),
                lx.data.Extraction(
                    extraction_class="revenue",
                    extraction_text="$15 million",
                    attributes={"value": 15000000, "currency": "USD"},
                ),
                lx.data.Extraction(
                    extraction_class="fiscal_period",
                    extraction_text="Q1 2023",
                    attributes={"period": "Q1 2023"},
                ),
            ]
        )
        examples.append(financial_example)
    elif any(keyword in query_lower for keyword in ["legal", "agreement", "parties", "effective date"]):
        legal_example = lx.data.ExampleData(
            text="This agreement is between John Doe and Jane Smith, effective 2024-01-01.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="party",
                    extraction_text="John Doe",
                    attributes={"name": "John Doe"},
                ),
                lx.data.Extraction(
                    extraction_class="party",
                    extraction_text="Jane Smith",
                    attributes={"name": "Jane Smith"},
                ),
                lx.data.Extraction(
                    extraction_class="effective_date",
                    extraction_text="2024-01-01",
                    attributes={"date": "2024-01-01"},
                ),
            ]
        )
        examples.append(legal_example)
    elif any(keyword in query_lower for keyword in ["social", "post", "feedback", "restaurant", "菜式", "評價"]):
        social_media_example = lx.data.ExampleData(
            text="I tried the new 'Taste Lover' restaurant in TST today. The black truffle risotto was amazing, but the Tiramisu was just average.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="restaurant_name",
                    extraction_text="Taste Lover",
                    attributes={"name": "Taste Lover"},
                ),
                lx.data.Extraction(
                    extraction_class="dish",
                    extraction_text="black truffle risotto",
                    attributes={"name": "black truffle risotto", "sentiment": "positive"},
                ),
                lx.data.Extraction(
                    extraction_class="dish",
                    extraction_text="Tiramisu",
                    attributes={"name": "Tiramisu", "sentiment": "neutral"},
                ),
            ]
        )
        examples.append(social_media_example)
    else:
        # Default generic example if no specific keywords match
        generic_example = lx.data.ExampleData(
            text="Juliet looked at Romeo with a sense of longing.",
            extractions=[
                lx.data.Extraction(
                    extraction_class="character", extraction_text="Juliet", attributes={"name": "Juliet"}
                ),
                lx.data.Extraction(
                    extraction_class="character", extraction_text="Romeo", attributes={"name": "Romeo"}
                ),
                lx.data.Extraction(
                    extraction_class="emotion", extraction_text="longing", attributes={"type": "longing"}
                ),
            ]
        )
        examples.append(generic_example)

    logging.info(f"Selected {len(examples)} few-shot example(s).")

    result = lx.extract(
        text_or_documents=unstructured_text,
        prompt_description=prompt,
        examples=examples,
        api_key=os.getenv("GOOGLE_API_KEY")
    )

    logging.info(f"Extraction result: {result}")

    # Convert the result to a JSON-serializable format
    extractions = [
        {"text": e.extraction_text, "class": e.extraction_class, "attributes": e.attributes}
        for e in result.extractions
    ]

    return {
        "extracted_data": extractions
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then the function is called load_gemini_key() and it returns a tuple with two things: the key itself (str) and a flag (bool) that tells you if the key is available. At the start, it sets key as an empty string and is_key_provided as False.&lt;/p&gt;

&lt;p&gt;Then it checks if a file called .streamlit/secrets.toml exists and if it contains "GOOGLE_API_KEY". If yes, it pulls the key from there, shows a green success message in the sidebar saying it’s using the secrets file, and sets the flag to True.&lt;/p&gt;

&lt;p&gt;If the key isn’t found in the secrets file, it falls back to asking the user directly. In the sidebar, it shows a password-style text input box where the user can paste their Gemini API key.&lt;/p&gt;

&lt;p&gt;If the user enters something, it displays another green success message and sets the flag to True. If they leave it empty, it shows a red error message saying there’s no key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Streamlit utility functions
def load_gemini_key() -&amp;gt; tuple[str, bool]:
    """Load the Gemini API key from the environment variable or user input."""
    key = ""
    is_key_provided = False
    secrets_file = os.path.join(".streamlit", "secrets.toml")
    if os.path.exists(secrets_file) and "GOOGLE_API_KEY" in st.secrets.keys():
        key = st.secrets["GOOGLE_API_KEY"]
        st.sidebar.success('Using Gemini Key from secrets.toml')
        is_key_provided = True
    else:
        key = st.sidebar.text_input(
            'Add Gemini API key and press \'Enter\'', type="password")
        if len(key) &amp;gt; 0:
            st.sidebar.success('Using the provided Gemini Key')
            is_key_provided = True
        else:
            st.sidebar.error('No Gemini Key')
    return key, is_key_provided
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next we made format_output_agraph(output) takes a dictionary with "nodes" and "edges" and converts each node into a Node object (with id, label, size, and shape) and each edge into an Edge object (with source, target, label, colour, and arrows), returning two lists ready for visualisation.&lt;/p&gt;

&lt;p&gt;And we create display_agraph(nodes, edges), then set up the graph’s appearance and behaviour with a Config object, controlling width, height, directed layout, physics simulation, hierarchical layout, highlight colour, collapsibility, and which property to use as the node label.&lt;/p&gt;

&lt;p&gt;Finally, it calls agraph() with the nodes, edges, and config to render the graph in the Streamlit app, providing a simple pipeline from raw graph data to an interactive, styled visualisation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def format_output_agraph(output):
    nodes = []
    edges = []
    for node in output["nodes"]:
        nodes.append(
            Node(id=node["id"], label=node["label"], size=8, shape="diamond"))
    for edge in output["edges"]:
        edges.append(Edge(source=edge["source"], label=edge["relation"],
                     target=edge["target"], color="#4CAF50", arrows="to"))
    return nodes, edges

def display_agraph(nodes, edges):
    config = Config(width=950,
                    height=950,
                    directed=True,
                    physics=True,
                    hierarchical=True,
                    nodeHighlightBehavior=False,
                    highlightColor="#F7A7A6",
                    collapsible=False,
                    node={'labelProperty': 'label'},
                    )
    return agraph(nodes=nodes, edges=edges, config=config)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, we develop the extract_entities(documents) function, which loops through each document and calls document_extractor_tool with a query to extract financial entities like company names, revenue figures, and fiscal periods, collecting all results into a single list.&lt;/p&gt;

&lt;p&gt;Similarly, extract_relationships(documents) processes each document to extract connections and relationships between these entities, such as revenue links between companies and fiscal periods, again aggregating all results into a list.&lt;/p&gt;

&lt;p&gt;Together, they convert raw text documents into structured entity and relationship data that can later be used to build a graph or knowledge network.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Core GraphRAG functions
def extract_entities(documents: List[str]) -&amp;gt; List[Dict[str, Any]]:
    """Extract entities from documents"""
    all_entities = []

    for doc in documents:
        result = document_extractor_tool(
            doc, 
            "Extract financial entities including company names, revenue figures, and fiscal periods from business documents"
        )
        all_entities.extend(result["extracted_data"])

    return all_entities

def extract_relationships(documents: List[str]) -&amp;gt; List[Dict[str, Any]]:
    """Extract relationships between entities"""
    all_relationships = []

    for doc in documents:
        result = document_extractor_tool(
            doc,
            "Extract financial relationships and revenue connections between companies and fiscal periods"
        )
        all_relationships.extend(result["extracted_data"])

    return all_relationships
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we build_graph_data(entities, relationships) first converts each entity into a graph node, assigning a unique ID, label, and type, while storing a mapping from text to node ID.&lt;/p&gt;

&lt;p&gt;It then processes relationships: for each relationship, it searches for mentioned entities and creates edges connecting them with the relationship type as the label. If no explicit relationships are found, it falls back to generating simple co-occurrence edges between all entities to ensure the graph is connected.&lt;/p&gt;

&lt;p&gt;Then the answer_query(entities, relationships, query) function lets you search the extracted data. It splits the query into words and finds entities whose text or attributes match any of those words, doing the same for relationships. It returns a dictionary containing the query, lists of relevant entities and relationships, and counts of each.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def build_graph_data(entities: List[Dict[str, Any]], relationships: List[Dict[str, Any]]) -&amp;gt; Dict[str, Any]:
    """Build graph data for visualization"""
    nodes = []
    edges = []

    # Create nodes from entities
    entity_map = {}
    for i, entity in enumerate(entities):
        node_id = str(i)
        nodes.append({
            "id": node_id,
            "label": entity["text"],
            "type": entity["class"]
        })
        entity_map[entity["text"].lower()] = node_id

    # Create edges from relationships and simple co-occurrence
    for rel in relationships:
        rel_text = rel["text"].lower()
        found_entities = []

        # Find entities mentioned in this relationship
        for entity_text, entity_id in entity_map.items():
            if entity_text in rel_text:
                found_entities.append(entity_id)

        # Create edges between found entities
        for i in range(len(found_entities)):
            for j in range(i + 1, len(found_entities)):
                edges.append({
                    "source": found_entities[i],
                    "target": found_entities[j],
                    "relation": rel["class"]
                })

    # If no relationships found, create simple co-occurrence edges
    if not edges:
        st.write("No relationship edges found, creating fallback edges...")
        for i, entity1 in enumerate(entities):
            for j, entity2 in enumerate(entities):
                if i &amp;lt; j:
                    # Create edges between all entities
                    edges.append({
                        "source": str(i),
                        "target": str(j),
                        "relation": "related_to"
                    })

    return {"nodes": nodes, "edges": edges}

def answer_query(entities: List[Dict[str, Any]], relationships: List[Dict[str, Any]], query: str) -&amp;gt; Dict[str, Any]:
    """Answer query using extracted entities and relationships"""
    if not query:
        return None

    # Find relevant entities
    relevant_entities = [
        e for e in entities 
        if any(word.lower() in e["text"].lower() or word.lower() in str(e["attributes"]).lower() 
               for word in query.split())
    ]

    # Find relevant relationships
    relevant_relationships = [
        r for r in relationships
        if any(word.lower() in r["text"].lower() or word.lower() in str(r["attributes"]).lower()
               for word in query.split())
    ]

    return {
        "query": query,
        "relevant_entities": relevant_entities,
        "relevant_relationships": relevant_relationships,
        "entity_count": len(relevant_entities),
        "relationship_count": len(relevant_relationships)
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then we create the process_documents Function is the main pipeline that ties everything together. It takes a list of text documents and an optional query. First, it calls extract_entities and extract_relationships to pull structured financial entities and their connections from the documents, then prints debug info showing how many entities and relationships were found. Next, it passes these to build_graph_data, creates nodes and edges for visualization, and prints debug info about the graph size.&lt;/p&gt;

&lt;p&gt;Finally, if a query is provided, it calls answer_query to find relevant entities and relationships matching the query. The function returns a dictionary containing all extracted entities, relationships, the graph data, and any query results, giving a complete structured view of the documents and making it easy to visualise or analyse further.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def process_documents(documents: List[str], query: str = None) -&amp;gt; Dict[str, Any]:
    """Process documents and optionally answer a query"""
    # Extract entities and relationships
    entities = extract_entities(documents)
    relationships = extract_relationships(documents)

    # Debug info
    st.write(f"Debug: Found {len(entities)} entities, {len(relationships)} relationships")

    # Build graph data
    graph_data = build_graph_data(entities, relationships)

    # Debug graph data
    st.write(f"Debug: Graph has {len(graph_data['nodes'])} nodes, {len(graph_data['edges'])} edges")

    # Answer query if provided
    results = answer_query(entities, relationships, query) if query else None

    return {
        "entities": entities,
        "relationships": relationships,
        "graph_data": graph_data,
        "results": results
    }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, we set the page title and layout, then display a header. Next, it loads the Gemini API key usingload_gemini_key(); if no key is provided, it warns the user and stops execution. If a key is available, it sets it as an environment variable so the extractor functions can use it.&lt;/p&gt;

&lt;p&gt;The app uses a set of predefined documents about tech companies and displays a success message indicating how many documents will be processed. Users can optionally enter a query in a text input. When the “Process Documents” button is clicked, it process_documents is called with the documents and an optional query. This returns entities, relationships, graph data, and query results.&lt;/p&gt;

&lt;p&gt;The results are displayed in four tabs: Graph Visualisation, Entities, Relationships, and Query Results. In the graph tab, format_output_agraph and display_agraph render an interactive knowledge graph. The entities and relationships tabs show extracted items with expandable JSON details for each. The query tab displays relevant results if a query was provided. Altogether, this function ties the full pipeline into an interactive, user-friendly Streamlit interface.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Streamlit app
def main():
    st.set_page_config(page_title="GraphRAG with LangExtract", layout="wide")
    st.title("GraphRAG with LangExtract")

    # Load API key
    api_key, is_key_provided = load_gemini_key()

    if not is_key_provided:
        st.warning("Please provide an API key to continue")
        return

    # Set environment variable
    os.environ["GOOGLE_API_KEY"] = api_key

    # Predefined documents
    documents = [
        "Apple Inc. was founded by Steve Jobs and Steve Wozniak in 1976. The company is headquartered in Cupertino, California. Steve Jobs served as CEO until his death in 2011.",
        "Microsoft Corporation was founded by Bill Gates and Paul Allen in 1975. It's based in Redmond, Washington. Bill Gates was the CEO for many years.",
        "Both Apple and Microsoft are major technology companies that compete in various markets including operating systems and productivity software. They have a long history of rivalry.",
        "Google was founded by Larry Page and Sergey Brin in 1998. The company started as a search engine but has expanded into many areas including cloud computing and artificial intelligence."
    ]

    st.success(f"Using {len(documents)} predefined documents about tech companies")

    # Query input
    query = st.text_input("Enter your query (optional):")

    if st.button("Process Documents"):
        with st.spinner("Processing documents..."):
            result = process_documents(documents, query if query else None)

            # Display results in tabs
            tab1, tab2, tab3, tab4 = st.tabs(["Graph Visualization", "Entities", "Relationships", "Query Results"])

            with tab1:
                if result["graph_data"]:
                    st.subheader("Knowledge Graph")

                    nodes, edges = format_output_agraph(result["graph_data"])
                    if nodes:
                        display_agraph(nodes, edges)
                    else:
                        st.info("No graph data to display")

            with tab2:
                st.subheader("Extracted Entities")
                if result["entities"]:
                    for i, entity in enumerate(result["entities"]):
                        with st.expander(f"{entity['text']} ({entity['class']})"):
                            st.json(entity["attributes"])
                else:
                    st.info("No entities extracted")

            with tab3:
                st.subheader("Extracted Relationships")
                if result["relationships"]:
                    for i, rel in enumerate(result["relationships"]):
                        with st.expander(f"{rel['text']} ({rel['class']})"):
                            st.json(rel["attributes"])
                else:
                    st.info("No relationships extracted")

            with tab4:
                if query and result["results"]:
                    st.subheader("Query Results")
                    st.json(result["results"])
                else:
                    st.info("No query provided or no results")

if __name__ == "__main__":
    main()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Conclusion :
&lt;/h1&gt;

&lt;p&gt;LangExtract alone cannot solve everything, but new AI tools must be developed and released. Using various AI tools together will highlight the problems with each tool, leading to further improvements. AI has made remarkable progress in recent years, but behind this progress is feedback from many people. There is no failure in using AI. It might be a good idea to try it out first and develop AI ourselves.&lt;/p&gt;

&lt;p&gt;I would highly appreciate it if you&lt;/p&gt;

&lt;p&gt;❣ Join my Patreon: &lt;a href="https://www.patreon.com/GaoDalie_AI" rel="noopener noreferrer"&gt;https://www.patreon.com/GaoDalie_AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Book an Appointment with me: &lt;a href="https://topmate.io/gaodalie_ai" rel="noopener noreferrer"&gt;https://topmate.io/gaodalie_ai&lt;/a&gt;&lt;br&gt;
Support the Content (every Dollar goes back into the -video):&lt;a href="https://buymeacoffee.com/gaodalie98d" rel="noopener noreferrer"&gt;https://buymeacoffee.com/gaodalie98d&lt;/a&gt;&lt;br&gt;
Subscribe to the Newsletter for free:&lt;a href="https://substack.com/@gaodalie" rel="noopener noreferrer"&gt;https://substack.com/@gaodalie&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
