<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Glen Yu</title>
    <description>The latest articles on DEV Community by Glen Yu (@glen_yu).</description>
    <link>https://dev.to/glen_yu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3340387%2Fe7f18162-e0df-40a1-9b5f-7f8fe56449f6.png</url>
      <title>DEV Community: Glen Yu</title>
      <link>https://dev.to/glen_yu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/glen_yu"/>
    <language>en</language>
    <item>
      <title>ML acceleration guide: TPUs vs GPUs</title>
      <dc:creator>Glen Yu</dc:creator>
      <pubDate>Tue, 28 Apr 2026 00:16:10 +0000</pubDate>
      <link>https://dev.to/gde/ml-acceleration-guide-tpus-vs-gpus-16oh</link>
      <guid>https://dev.to/gde/ml-acceleration-guide-tpus-vs-gpus-16oh</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;There’s a lot of hype around GPUs and NVIDIA, but how much do you know about TPUs?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3x0q0rg9actpj3f3bokn.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3x0q0rg9actpj3f3bokn.JPG" alt="Rack of TPUs at Google Next" width="800" height="1422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Article includes code examples you can find near the end&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Rise of GPUs
&lt;/h2&gt;

&lt;p&gt;Graphics Processing Units have been around for quite some time and their job is to render 2D and 3D graphics in to millions of pixels, calculating their colour, texture, lighting, in parallel to send to your monitor. For a 60Hz monitor that means producing rendered frames 60 times every second.&lt;/p&gt;

&lt;p&gt;Rendering graphics is one thing, but developing code for handling GPUs was a little more difficult. That is, until NVIDIA launched CUDA (Compute Unified Device Architecture) in 2006, which allowed scientific researchers and developers who work in fields that require massive parallel math to take advantage of a GPU’s capabilities. With the rise of machine learning in the early 2010’s, it was discovered that the massive parallel math was exactly what ML engineers needed to train deep neural networks. Since then, the focus of CUDA has been shifting more towards optimizing for machine learning and AI workloads.&lt;/p&gt;

&lt;p&gt;Because GPUs were commercially available and relatively inexpensive at the time, the barrier to entry was low. An ML engineer could train models on their NVIDIA graphics card during the day and jump into a game of League of Legends at night on the same hardware.&lt;/p&gt;

&lt;h3&gt;
  
  
  Honourable mention
&lt;/h3&gt;

&lt;p&gt;AMD’s GPUs with Radeon Open Compute (ROCm) in an open-source software stack designed to compete in the AI ecosystem. Though it’s not as popular as CUDA, this gap is closing with &lt;a href="https://www.amd.com/en/newsroom/press-releases/2026-2-24-amd-and-meta-announce-expanded-strategic-partnersh.html" rel="noopener noreferrer"&gt;Meta recently signing a deal to expand its existing partnership with AMD&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tensor Processing Unit
&lt;/h2&gt;

&lt;p&gt;In the early 2010s, Google projected that the growing demands of its AI workloads, particularly the rapid adoption of deep learning across products like Search and Photos, would require doubling its data center computing capacity roughly every year and a half. Rather than scale generic hardware indefinitely, Google sought a more efficient solution purpose-built for neural network computation, and thus the Tensor Processing Unit (TPU) was born. The TPU is a custom application-specific integrated circuit (ASIC) designed by Google specifically to accelerate AI workloads, deployed internally starting in 2015. By specializing the hardware for the dense matrix operations at the heart of neural networks, TPUs achieve dramatically better performance per watt than general-purpose CPUs or GPUs, reducing both energy consumption and cooling demands at data center scale.&lt;/p&gt;

&lt;p&gt;Google has a tradition of making tools it uses internally available to the broader world, and TPUs are another example of this. The existence of TPUs was first publicly announced at Google I/O in 2016. In 2018, Cloud TPU v2 became available for external users through Google Cloud, marking the first time developers outside Google could harness the same accelerators powering Google’s own AI systems. TPUs also come in two performance flavours: &lt;em&gt;efficiency&lt;/em&gt; and &lt;em&gt;performance&lt;/em&gt; to meet different market needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt;: As of the 8th generation of TPUs announced during Google Next 2026, &lt;em&gt;efficiency&lt;/em&gt; and &lt;em&gt;performance&lt;/em&gt; TPUs will be renamed &lt;em&gt;&lt;strong&gt;inference&lt;/strong&gt;&lt;/em&gt; and &lt;em&gt;&lt;strong&gt;training&lt;/strong&gt;&lt;/em&gt; respectively in favour of a more descriptive, workload-based naming convention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture layout
&lt;/h2&gt;

&lt;p&gt;From an architectural standpoint, GPUs can be thought of as being individual computers with accelerators (picture your home gaming PC). If you want to connect them into a cluster, it would be over network, but no matter how fast the network is, it still has to cross node boundaries, and bandwidth drops as a result.&lt;/p&gt;

&lt;p&gt;TPUs are designed from the ground up to be interconnected at a massive scale with a physical layout that involves thousands of TPU chips in a torus topology which gives every chip 6 neighbours (two per axis, one on each side). Recognize that interconnect bandwidth would be the main bottleneck at this scale, Google designed their own proprietary Inter-Chip Interconnect (ICI) network which provides uniform, high-bandwidth, low-latency connections between all the chips in a slice regardless of physical location. With torus topology, there is no concept of crossing a node boundary. When you request TPUs, you do not get the entire TPU cluster or pod. Rather, you get only a small subset or slice. To make this possible, Google developed Optical Circuit Switch (OCS) to be able to rewire physical connections on the fly (entirely in software), allowing the same hardware to serve different workload shapes without any physical reconfiguration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt;: Efficiency TPU versions use a 2D torus topology, while Performance TPUs leverage a 3D torus architecture to give you maximum performance with minimum latency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Precision and range
&lt;/h2&gt;

&lt;p&gt;A floating-point number consists of three parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sign&lt;/strong&gt;: Positive or negative (represented by the first bit)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exponent&lt;/strong&gt;: Determines the range of the number&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mantissa&lt;/strong&gt;: Significant digits of a floating-point number, which determines the accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditionally, the standard for high-performance computing was FP32. When AI researchers moved to FP16 to save memory, they lost more than just accuracy: they also lost range. FP32 uses 8 bits for the exponent, while FP16 uses only 5. The 3-bit difference in the exponent bits amount to an almost 10³⁴ difference in range (FP32 has a range of 3.4 x 10³⁸, while FP16 only has a range of 6.5 x 10⁴). In deep learning, where gradients can be incredibly tiny, FP16 often suffers from underflow (meaning numbers are being rounded to 0 because it is too small for FP16’s range to represent), requiring a technical workaround called “&lt;a href="https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#lossscaling" rel="noopener noreferrer"&gt;loss scaling&lt;/a&gt;” to keep the math stable.&lt;/p&gt;

&lt;p&gt;Google Brain (now part of Google DeepMind) solved this invented Brain Floating Point (&lt;em&gt;&lt;strong&gt;bfloat16&lt;/strong&gt;&lt;/em&gt;), which simply shifted 3 bits from the mantissa to exponents:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Format&lt;/th&gt;
&lt;th&gt;Total Bits&lt;/th&gt;
&lt;th&gt;Exponent Bits&lt;/th&gt;
&lt;th&gt;Mantissa Bits&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FP32&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FP16&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;bfloat16&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;By sacrificing precision for range, bfloat offers the same massive range as FP32, but with the reduced memory and bandwidth of FP16. A huge reason for why this works is that deep learning models are surprisingly noise-tolerant and having more training stability is for more important than having a few extra decimal places of precision. Today, bfloat16 is the de facto standard for training modern LLMs on NVIDIA’s GPUs and Google’s TPUs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why XLA matters
&lt;/h2&gt;

&lt;p&gt;Standard Python execution typically takes an &lt;em&gt;eager&lt;/em&gt; approach. This means it executes each step as it is being encountered. This is great for debugging because you can insert print statements to inspect variables at any point.&lt;/p&gt;

&lt;p&gt;XLA (Accelerated Linear Algebra), on the other hand, is a domain-specific JIT compiler. Instead of executing steps one by one, it analyzes the entire execution graph to optimize and fuse operations before they run. This &lt;em&gt;lazy&lt;/em&gt; approach creates an initial warm-up delay, but once the training starts, it is significantly faster than standard methods. The tradeoff is transparency: your step-by-step Python code becomes an optimized “black box”, making traditional debugging strategies more difficult. This is why TPUs are powerhouses for massive enterprise training, while GPUs remain the flexible choice for quick experimentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt;: Though XLA was built for TPUs, it’s also made its way into the NVIIA GPU ecosystem via tools such as JAX and torch.compile since PyTorch 2.0.&lt;/p&gt;

&lt;h3&gt;
  
  
  TorchTPU
&lt;/h3&gt;

&lt;p&gt;Google is engineering a &lt;a href="https://developers.googleblog.com/torchtpu-running-pytorch-natively-on-tpus-at-google-scale/" rel="noopener noreferrer"&gt;TorchTPU&lt;/a&gt; stack that will provide native PyTorch support. This would allow you to run models in TPUs as they are with full support for native PyTorch features. TorchTPU is currently in preview, and once it becomes GA, you can be sure I’ll be diving deeper into it!&lt;/p&gt;

&lt;h2&gt;
  
  
  Code example
&lt;/h2&gt;

&lt;p&gt;I’m including a couple of Jupyter notebooks that I ran via &lt;a href="https://medium.com/google-cloud/leveraging-tpus-in-colab-featuring-antigravity-c312ad12c1b6" rel="noopener noreferrer"&gt;Antigravity + Colab plugin&lt;/a&gt; for you to try yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://storage.googleapis.com/public-file-server/aiml/mnist_w_gpu_cuda.ipynb" rel="noopener noreferrer"&gt;Fashion MNIST with GPU and CUDA&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://storage.googleapis.com/public-file-server/aiml/mnist_w_tpu_xla.ipynb" rel="noopener noreferrer"&gt;Fashion MNIST with TPU and XLA&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As you will see from the results below, TPU is indeed faster. However, my example isn’t large enough or complex enough to really showcase the true speeds that TPU can bring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt;: I have a Colab Pro account which affords me access to additional GPUs and TPUs. The Colab free tier only includes NVIDIA T4 and TPU v5e-1&lt;/p&gt;

&lt;h3&gt;
  
  
  Interpreting training results
&lt;/h3&gt;

&lt;p&gt;These are some benchmark trainings (epochs: 50, batch size: 512) in which I used a NVIDIA T4 GPU with (default) FP32 vs Google TPU v5e-1 (single chip TPU) with bfloat16. As expected, TPUs were faster but with lower precision:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F76ssyverv2op61zythqn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F76ssyverv2op61zythqn.png" alt="T4 GPU (FP32), epochs: 50, batch size: 512" width="800" height="996"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzamdd219pa47g3sjwhm4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzamdd219pa47g3sjwhm4.png" alt="TPU v5e-1 (bfloat16), epochs: 50, batch size: 512" width="800" height="994"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I then trained the same model using the T4 GPU using bfloat16 but noticed a massive performance drop. This was due to the T4 being an older generation GPU that did not support bfloat16 natively and had to emulate which added a lot of overhead. Switching to a newer L4 GPU, I was able to see the (tiny) performance gain along with the reduced precision:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe3c8d45visujw7eupu75.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe3c8d45visujw7eupu75.png" alt="T4 GPU (bfloat16), epochs: 50, batch size: 512" width="800" height="1002"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdny1eu3il9ogeyral3v6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdny1eu3il9ogeyral3v6.png" alt="L4 GPU (bfloat16), epochs: 50, batch size: 512" width="640" height="768"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Finally, I thought I’d see how the training would perform on a newer TPU v6e-1 and I was blown away by the improvement:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31c4dykl3knydhg73mmi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31c4dykl3knydhg73mmi.png" alt="TPU v6e-1 (bfloat16), epochs: 50, batch size: 512" width="640" height="787"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Comparing GPUs and TPUs isn’t exactly apples-to-apples. They represent fundamentally different philosophies in architecture, memory management, and execution.&lt;/p&gt;

&lt;p&gt;In the modern enterprise, it isn’t usually a matter of choosing one over the other, but rather using each where it shines. For rapid iteration and smaller workloads, the flexibility of GPUs is unmatched. However, once a project hits a certain scale, the domain-specific architecture of the TPU becomes the clear winner in efficiency and throughput.&lt;/p&gt;

&lt;p&gt;TPUs are as fast as they are because they are a specialized one-trick pony, but to truly harness that power requires a deeper understanding of the stack. The biggest challenge isn’t often the compute itself, but rather: “How do I feed data to the TPUs fast enough and efficiently enough so that it doesn’t become the bottleneck?” and ensuring your input pipeline is fast enough so that the hardware doesn’t sit idle.&lt;/p&gt;

&lt;p&gt;In future posts, I'll dive deeper into these advanced concepts to show how you can optimizing data pipelines to get the most out of your TPUs.&lt;/p&gt;

&lt;h3&gt;
  
  
  BONUS: Google’s 8th-generation TPUs announced at Google Next
&lt;/h3&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/3Qw_CZkiQQg"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

</description>
      <category>tpu</category>
      <category>gpu</category>
      <category>machinelearning</category>
      <category>tpusprint</category>
    </item>
    <item>
      <title>Implementing a RAG system: Run</title>
      <dc:creator>Glen Yu</dc:creator>
      <pubDate>Tue, 07 Apr 2026 17:00:00 +0000</pubDate>
      <link>https://dev.to/gde/implementing-a-rag-system-run-148g</link>
      <guid>https://dev.to/gde/implementing-a-rag-system-run-148g</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;In the "Crawl" and "Walk" phases, I introduced the basics of RAG and explored ways to optimize the pipeline to increase efficiency and accuracy. Armed with this knowledge, it's time to productionize our learnings.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Run
&lt;/h2&gt;

&lt;p&gt;In the "&lt;a href="https://dev.to/gde/implementing-a-rag-system-crawl-5li"&gt;Crawl&lt;/a&gt;" and "&lt;a href="https://dev.to/gde/implementing-a-rag-system-walk-4h76"&gt;Walk&lt;/a&gt;" phases, we explored RAG fundamentals using local tools, proving how much document processing and re-ranking impact performance. While you could certainly scale those manual workflows into production, do you really wan to manage the infrastructure, data pipelines and scaling hurdles yourself?&lt;/p&gt;

&lt;p&gt;Welcome to the "Run" phase. Here we leverage Google Cloud's Vertex AI RAG Engine - a fully managed solution that automates the entire pipeline so you can focus on building, not maintenance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn99a73jjwmkfpn4uzxu5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn99a73jjwmkfpn4uzxu5.png" alt="" width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Vertex AI RAG Engine
&lt;/h2&gt;

&lt;p&gt;Vertex AI RAG Engine is a low-code, fully managed solution for building AI applications on private data. It handles the ingestion, document processing, embedding, retrieval, ranking, and grounding to ensure that the response is highly accurate and relevant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Optional document pre-processing with Docling
&lt;/h3&gt;

&lt;p&gt;Though RAG Engine comes with its own parser options, I still opted to pre-process my documents using Docling first and upload them to a Google Cloud Storage bucket for ingestion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;DATA_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://storage.googleapis.com/public-file-server/genai-downloads/bc_hr_policies.tgz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;DATA_DIR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;download_and_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DATA_URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;docling_docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./docling_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;docling_docs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;exist_ok&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt; Processing PDFs...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;data_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DATA_DIR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;pdf_files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pdf_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;markdown_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;export_to_markdown&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

            &lt;span class="n"&gt;docling_doc_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;stem&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.md&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;docling_docs/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;docling_doc_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;markdown_content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error on &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt; Uploading Docling docs to GCS...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;upload_folder_to_gcs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./docling_docs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GCS_BUCKET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GCS_BUCKET_PATH&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One benefit of doing it this way is I get more visibility and control over the document processing step and can validate the contents from the original PDFs with the Docling documents (Markdown). This allows me to use the Default parsing libraries option, which is also free compared to the LLM parser and Document AI layout parser options which have an additional cost and setup component to them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9li85pcda4g51p69ne0l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9li85pcda4g51p69ne0l.png" alt="Chunking strategy" width="800" height="829"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I do lose out on the benefits of the hybrid chunking strategy that Docling would have provided (as seen in the "Walk" phase), because that is determined by the layout parser that I choose here. If I wasn't using Docling, I think the &lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/rag-engine/llm-parser" rel="noopener noreferrer"&gt;LLM parser&lt;/a&gt; would be the parser option that I'd be gravitating towards.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vector database options
&lt;/h3&gt;

&lt;p&gt;When it comes to vector database options, you'll see several choices in the RAG Engine menu (including regional "Preview" features). I chose the &lt;em&gt;RagManaged Cloud Spanner&lt;/em&gt; (also referred to as the &lt;code&gt;RagManagedDb&lt;/code&gt;) because it offers the fastest path from data to insights with the least amount of infrastructure management. While Spanner is typically an enterprise-grade database, the RAG Engine allows you to spin it up on the &lt;strong&gt;&lt;em&gt;Basic tier&lt;/em&gt;&lt;/strong&gt;. This allocates 100 processing units, which is 10% of a Spanner node, making it perfect for smaller datasets while still giving you the reliability of a managed service without the enterprise-grade cost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffwisiqehtbk03nk4mhpw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffwisiqehtbk03nk4mhpw.png" alt="Spanner basic tier" width="800" height="629"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IMPORTANT&lt;/strong&gt;: Even on basic tier, this will still run you about $65 USD/month, so please remember to delete and clean up this RAG corpus once you're done experimenting with it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frs8lrssvzgfv06uugjv0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frs8lrssvzgfv06uugjv0.png" alt="Embedding model and vector DB" width="800" height="693"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For those prioritizing flexibility, the RAG Engine also supports third-party options like &lt;a href="https://www.pinecone.io" rel="noopener noreferrer"&gt;Pinecone&lt;/a&gt; and &lt;a href="https://weaviate.io" rel="noopener noreferrer"&gt;Weaviate&lt;/a&gt;. These are excellent choices if portability is a requirement, allowing you to maintain a consistent vector store even if you decide to shift parts of your RAG stack to a different cloud provider or platform later on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ranking &amp;amp; grounding included
&lt;/h3&gt;

&lt;p&gt;Once the RAG corpus is created, you can perform some manual testing to validate. When you ask RAG Engine for search results, re-ranking and grounding is done automatically to ensure relevance and correctness:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F01et2bfhih9z7i72x9i8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F01et2bfhih9z7i72x9i8.png" alt="RAG Engine corpus test" width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Armor
&lt;/h2&gt;

&lt;p&gt;In a production setting (especially if it's going to be public facing), you will want guardrails. I've written about Guardrails with Agent Development Kit in the past and is implemented through callbacks within ADK. It works the same way here and can be used to inspect text as it flows into and out of the LLM/agent. Key capabilities include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prompt injection &amp;amp; jailbreak detection&lt;/strong&gt;: Attempts to trick the AI into ignoring instructions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sensitive Data Protection&lt;/strong&gt;: Natively integrates with Google's Data Loss Prevention (DLP) to scale for various types of sensitive information (PII)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Malicious URL detection&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Responsible AI (RAI) filters&lt;/strong&gt;: Hate speech, harassment, dangerous, and sexually explicit content&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I configured my Model Armor policy template which I then invoked via the Python SDK library to determine whether the text was safe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Updated example
&lt;/h2&gt;

&lt;p&gt;You can find the code for "Run" phase → &lt;a href="https://github.com/Neutrollized/rag-systems-crawl-walk-run/tree/main/03_run" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Querying the vector database (RAG Engine in this case) is a lot less involved as I don't have to write a many of the logic to pass the semantic search results to a re-ranker, because RAG Engine takes care of all of that for me! &lt;/p&gt;

&lt;p&gt;I once again ran the same two benchmark questions as I did in the "Crawl" and "Walk" phases:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fioy5toiaa3phxqp2er17.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fioy5toiaa3phxqp2er17.png" alt="HR RAG ADK Agent w/Gemini 3.1 Pro Preview + RAG Engine" width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I threw in a couple of extra questions to make sure Model Armor wasn't sleeping on the job, but overall I liked the detail and accuracy of the answers I was provided.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnge73hhylqdstzcuxges.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnge73hhylqdstzcuxges.png" alt="HR RAG ADK Agent w/Gemini 3.1 Pro Preview + RAG Engine + Model Armor" width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt;: In my updated example, I only added a &lt;code&gt;before_model_callback&lt;/code&gt;, meaning I'm only checking the input prompts and not the response. An &lt;code&gt;after_model_callback&lt;/code&gt; should be implemented to ensure the generated response is also scanned and preventing the AI from accidentally leaking sensitive internal data it might have retrieved from the RAG corpus (I omitted the output check here simply because I know there's no sensitive data in this particular dataset).&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The purpose of this "Crawl, Walk, Run" series was to take you on a journey from managing code to delivering value. In the earlier phases, we deconstructed the mechanics of RAG works to understand the roles that chunking, embedding, re-ranking play in the overall system. In this final phase, we see how Vertex AI RAG Engine and Model Armor streamline those manual components. By offloading infrastructure management and safety logic to Google Cloud's managed services, you can ensure your system is scalable, accurate, and secure from day one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Next Steps
&lt;/h3&gt;

&lt;p&gt;Currently in private preview is &lt;a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/rag-engine/use-rag-managed-vertex-ai-vector-search" rel="noopener noreferrer"&gt;Vector Search 2.0 with RAG&lt;/a&gt;, but reading through its documentation and features, it looks pretty interesting, so once it becomes GA, I will definitely give it a try.&lt;br&gt;
I'm also looking forward to all the new AI-related announcements that is sure to happen at &lt;a href="https://www.googlecloudevents.com/next-vegas" rel="noopener noreferrer"&gt;Google Next&lt;/a&gt;!&lt;/p&gt;

&lt;h3&gt;
  
  
  Additional learning
&lt;/h3&gt;

&lt;p&gt;Interested in finding out more about how to secure your agent by sanitizing input and output? Try out this &lt;a href="https://codelabs.developers.google.com/secure-agent-modelarmor#0?utm_campaign=CDR_0xe7f5807a_default_b479282946&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Model Armor&lt;/a&gt; Codelab! &lt;br&gt;
Vertex AI RAG Engine isn't your only option for a managed RAG, if you'd like to try a different option that uses Vector AI Search, might I suggest the &lt;a href="https://codelabs.developers.google.com/codelabs/production-ready-ai-with-gc/7-advanced-agent-capabilities/building-agents-with-retrieval-augmented-generation#0?utm_campaign=CDR_0xe7f5807a_default_b479282946&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Building Agents with Retrieval-Augmented Generation&lt;/a&gt; Codelab?&lt;/p&gt;

&lt;p&gt;Happy learning!&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ragengine</category>
      <category>adk</category>
      <category>modelarmor</category>
    </item>
    <item>
      <title>Implementing a RAG system: Walk</title>
      <dc:creator>Glen Yu</dc:creator>
      <pubDate>Tue, 31 Mar 2026 17:00:00 +0000</pubDate>
      <link>https://dev.to/gde/implementing-a-rag-system-walk-4h76</link>
      <guid>https://dev.to/gde/implementing-a-rag-system-walk-4h76</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Now that we've established the basics in our "Crawl" phase, it's time to pick up the pace. In this guid, we'll move beyond the initial setup to focus on optimizing core architectural components for better performance and accuracy.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Walk
&lt;/h2&gt;

&lt;p&gt;We ended the previous "&lt;a href="https://dev.to/gde/implementing-a-rag-system-crawl-5li"&gt;Crawl&lt;/a&gt;" design with a functioning AI HR agent with a RAG system. The responses, however, could be better. I've introduced some new elements to the architecture to perform better document processing and chunking, as well as re-ranker model to sort the semantic retrieval results by relevance:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqnucwuvtoz1yd2fcj0uz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqnucwuvtoz1yd2fcj0uz.png" alt="" width="800" height="311"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The ugly Docling
&lt;/h2&gt;

&lt;p&gt;IBM's Docling is an open-source document processing tool and easily one of the most effective ones I've tested. It can convert various file formats (e.g., PDF, docx, HTML) into clean, structured formats like Markdown and JSON. By integrating AI models and OCR, it doesn't just extract text, but also preserve the original layout's integrity.&lt;br&gt;
Through its hierarchical and hybrid chunking methods, Docling intelligently groups content by heading, merges smaller fragments for better context, and attaches rich metadata to streamline downstream searching and citations. Here's a Python function I use for chunking a PDF file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;docling_chunk_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;converter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DocumentConverter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;format_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;InputFormat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PDF&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;PdfFormatOption&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;pipeline_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pipeline_options&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;doc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;document&lt;/span&gt;
    &lt;span class="n"&gt;chunker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HybridChunker&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;chunk_texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_texts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I plan to take a deeper dive into Docling in a future article, so give me a follow so you won't miss it! 😄&lt;/p&gt;

&lt;h2&gt;
  
  
  Dot product vs cosine similarity
&lt;/h2&gt;

&lt;p&gt;In the "Crawl" post, I talked briefly about cosine similarity and how it ignore magnitude and only focuses on the angle between two vectors. This is because normalization is baked into the cosine similarity formula.&lt;br&gt;
Dot product is effectively cosine similarity but without the final normalization step, which is why its result is affected by the magnitude of the vectors. Since many modern embedding models output pre-normalized unit vectors, the extra normalization step in cosine similarity becomes a redundant calculation. By using dot product on these pre-normalized vectors, you can achieve identical results with higher computational efficiency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE #1:&lt;/strong&gt; While switching to dot product can increase your raw retrieval throughput, the latency gains may feel negligible when considering the entire end-to-end RAG pipeline depending on your particular use case and scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE #2:&lt;/strong&gt; A friendly reminder that choosing dot product over cosine similarity has the hard requirement that your vectors be normalized beforehand, or the magnitude will skew your search results. It's also quite easy to update your search configuration to use one or the other. If you're ever in doubt, just run a quick test with both settings to verify that both methods return the exact same nearest neighbours (top semantic matches).&lt;/p&gt;
&lt;h2&gt;
  
  
  Re-ranking
&lt;/h2&gt;

&lt;p&gt;Standard search is built for speed and not deep understanding, so it can sometimes miss nuances. Re-ranking takes a crucial "second look" at the standard retrieval results to see which one(s) actually address the user's query. While the Cosine distance represents how similar the query and document align in the vector space, a "close" match doesn't guarantee an answer. The re-ranker's job to is to bridge this gap by scrutinizing the top results to assign a true relevance_score and ensure the most helpful contexts rise to the top. Here's what that snippet of code looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;co&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cohere&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ClientV2&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;co&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;RERANKING_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;documents_to_rerank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;top_n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate_responses&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;reranked_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;original_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;candidate_responses&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;reranked_results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;original_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;original_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;heading&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;original_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;heading&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;original_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_distance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;original_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;search_distance&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;relevance_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relevance_score&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As part of the full code that performs the re-ranking, I assign a threshold for the relevance score. Scores lower than this threshold is deemed irrelevant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Updated example
&lt;/h2&gt;

&lt;p&gt;I'm shaking up the stack for the "Walk" phase! In addition to using a different document processor, I will also be using a different embedding model and vector database.&lt;br&gt;
Since I wanted to try out Cohere's re-ranking model, I opted to lean into their full suite and use their embedding model as well. I made a deliberate choice here to set the embedding dimension to &lt;code&gt;384&lt;/code&gt;, which is a lower than the &lt;code&gt;768&lt;/code&gt; I previous used in the "Crawl" example. I wanted to handicap the initial semantic search, and by doing so, we can more clearly see the re-ranker work its magic to fix the order or the results.&lt;br&gt;
I switched out ChromaDB with &lt;a href="https://lancedb.com" rel="noopener noreferrer"&gt;LanceDB&lt;/a&gt; to showcase just how many robust, easy-to-use open-source local vector databases are available for use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Querying the HR agent
&lt;/h3&gt;

&lt;p&gt;While I kept the core agent configuration from the "Crawl" phase the same, the addition of the re-ranking step made a significant impact. I asked the same two benchmark questions and this time the results were more refined and accurate:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9jij6wvu4r582l02ww6z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9jij6wvu4r582l02ww6z.png" alt="HR RAG + re-ranking ADK Agent w/Gemini 3.1 Pro Preview" width="800" height="527"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can find the code for the "Walk" phase → &lt;a href="https://github.com/Neutrollized/rag-systems-crawl-walk-run/tree/main/02_walk" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;p&gt;Now that we've manually optimized our retrieval and re-ranking, the next step is to scale. I will be migrating this architecture to Vertex AI's RAG Engine for a fully managed, high-performance RAG pipeline at an enterprise scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Additional learning
&lt;/h3&gt;

&lt;p&gt;I used Cohere's embedding and re-ranking models in my example, but if you want to try out Vertex AI's re-ranking capabilities (and more), try out this &lt;a href="https://codelabs.developers.google.com/codelabs/production-ready-ai-with-gc/8-advanced-rag-methods/advanced-rag-methods#0?utm_campaign=CDR_0xe7f5807a_default_b479282946&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;Advanced RAG Techniques&lt;/a&gt; Codelab.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>genai</category>
      <category>opensource</category>
      <category>adk</category>
    </item>
    <item>
      <title>Implementing a RAG system: Crawl</title>
      <dc:creator>Glen Yu</dc:creator>
      <pubDate>Tue, 24 Mar 2026 18:00:00 +0000</pubDate>
      <link>https://dev.to/gde/implementing-a-rag-system-crawl-5li</link>
      <guid>https://dev.to/gde/implementing-a-rag-system-crawl-5li</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;I'm starting a "Crawl, walk, run" series of posts on various topics and decided to start with Retrieval-Augmented Generation (RAG). Learn the basics and progress to a production-ready system!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Crawl
&lt;/h2&gt;

&lt;p&gt;In this phase of your journey, we're going to learn about the core concepts of a Retrieval-Augmented Generation (RAG) system and then apply them in a simple example.&lt;br&gt;
We're  going to build a Human Resources (HR) agent that can help answer and navigate HR-related questions. Using the &lt;a href="https://www2.gov.bc.ca/gov/content/careers-myhr/managers-supervisors/employee-labour-relations/conditions-agreements/policy/hr-policy-pdf" rel="noopener noreferrer"&gt;Government of British Columbia's HR Policy PDFs&lt;/a&gt; as our knowledge base, we will process, chunk, and embed the documents into a local vector database. This allows the agent to provided grounded answers and ensures that every response is rooted directly in the ingested BC government policies.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgoafc5mkjuzhbn4xr9p6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgoafc5mkjuzhbn4xr9p6.png" alt="" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Why RAG?
&lt;/h3&gt;

&lt;p&gt;RAG is a very common design pattern that turns a standard LLM into an informed AI agent. Standard models can be a "black box", but RAG gives your agent an "open-book test". It bypasses knowledge cutoffs by linking directly to your documents, providing factual grounding and citations. No fine-tuning is required, data can be updated quickly. RAG provides a real-time bridge between your LLM and your data.&lt;br&gt;
This post will focus primarily on indexing and retrieval of the your data. Let's get started!&lt;/p&gt;
&lt;h2&gt;
  
  
  How do you eat an elephant? One bite at a time
&lt;/h2&gt;

&lt;p&gt;It's not feasible to have to feed all the information into the AI every time you want to ask it a question. Instead, it is broken down into smaller, more manageable pieces called chunks, which the AI can process and retrieve efficiently.&lt;br&gt;
We will use a "recursive character chunking" strategy which is a fast and smart and will try to split at natural boundaries like paragraphs, but can still cut off mid-sentence if the chunk is too big. An overlap is used to ensure that context isn't lost at the edges of a the cut if a split does occur.&lt;/p&gt;

&lt;p&gt;Snippet of code used for splitting &amp;amp; chunking using LangChain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_community.document_loaders&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PyPDFLoader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DirectoryLoader&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_text_splitters&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;

&lt;span class="n"&gt;loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DirectoryLoader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;DATA_DIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;glob&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./**/*.pdf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;loader_cls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PyPDFLoader&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;text_splitter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RecursiveCharacterTextSplitter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunk_overlap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text_splitter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frb9dmvjp63cx85mif5kw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frb9dmvjp63cx85mif5kw.png" alt="Recursive character chunking" width="394" height="665"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Recursive character chunking is the successor to "fixed-sized chunking", which is just a fixed sliding window. Here, you always need the overlap because you never know how much of what sentence you're cutting off.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11wlflm1ejdaamj0his0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11wlflm1ejdaamj0his0.png" alt="Fixed-size chunking" width="392" height="496"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What's the vector representation of 'Life'? 
&lt;/h2&gt;

&lt;p&gt;The embedding process transforms text chunks into vectors, which are mathematical arrays of floating-point numbers that capture semantic meaning. However, higher dimensionality won't necessarily generate better results. For simple or straightforward documents, expanding the vector size often introduces latency and computational overhead without providing better search accuracy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Two sides of the same coin: Indexing &amp;amp; retrieval
&lt;/h3&gt;

&lt;p&gt;Indexing and retrieval are two parts of the same conversation, and you must use the same embedding model for both. This is important because every embedding model puts emphasis on words in a sentence differently. One might prioritize the subject, while another might prioritize the action, which would yield different results.&lt;/p&gt;

&lt;p&gt;What happens when different embedding models try to embed “To be, or not to be…”:&lt;br&gt;
  &lt;iframe src="https://www.youtube.com/embed/iQULEW2JwHE"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Selection the appropriate embedding type is also very important. The documents that you embed and index are usually a long, structured documents where the focus is on the information it &lt;em&gt;provides&lt;/em&gt;. This is in contrast to the user queries which are usually short, messy text, so the retrieval process focuses on the information it is &lt;em&gt;looking for&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding a match
&lt;/h2&gt;

&lt;p&gt;Once your user query is embedded, the RAG system performs a similarity search against the vector database to identify the most relevant answers. In most vector databases, this is calculated using cosine similarity. This metric focuses exclusively on the angle between vectors rather than their magnitude; it measures how closely the semantic "intent" (angle) of the query aligns with the document, regardless of the text's length or word frequency. This is important because it means the AI can recognize that a short question and a long technical manual can share the same intent even if their scale (magnitude) is completely different.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting it all together
&lt;/h2&gt;

&lt;p&gt;Link to my GitHub repository → &lt;a href="https://github.com/Neutrollized/rag-systems-crawl-walk-run/tree/main/01_crawl" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To handle HR questions, I'm building an agent using &lt;a href="https://google.github.io/adk-docs/get-started/python/" rel="noopener noreferrer"&gt;Google's Agent Development Kit (ADK)&lt;/a&gt; that connects directly to this RAG system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;query_hr&lt;/span&gt;
&lt;span class="n"&gt;hr_rag_tool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FunctionTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query_hr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;hr_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LlmAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hr_agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3.1-pro-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Specialist in company HR policies and procedures.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a professional HR assistant. Your goal is to answer questions &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;using ONLY the information retrieved from the &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query_hr&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; tool. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;When calling the &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;query_hr&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; tool, ensure all string arguments are properly formatted as standard JSON strings with double quotes.&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RULES:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1. If the tool returns relevant information, summarize it clearly.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2. You MUST cite your sources using the format: (Source: [Source Name], Page: [Page Number]).&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3. If the tool results do not contain the answer, state: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m sorry, I couldn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t find that in our HR documents.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4. Do not use outside knowledge or make up facts about company policy.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;query_hr&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By giving the agent clear instructions and the right tools to search our vector database, it should be able to pull precise answers for users in seconds:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8pu768fxnl6o3x7eycx9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8pu768fxnl6o3x7eycx9.png" alt="HR RAG ADK Agent w/Gemini 3.1 Pro Preview" width="800" height="571"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It answers questions reliably, but if I'm being honest, I can't help but feel we're only scratching the surface of the "full" answers that we're looking for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;p&gt;We have a working prototype, but there's still plenty of room to grow. To transform this from a simple RAG system into a high performance engine, our next steps will focus on precision. We'll refine how we process and chunk documents and introduce a reranking layer to our search results to significantly boost the quality of the agent's responses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Additional learning
&lt;/h3&gt;

&lt;p&gt;If you haven't used Agent Development Kit yet, but would like to learn more, checkout this Codelab: "&lt;a href="https://codelabs.developers.google.com/onramp/instructions#0?utm_campaign=CDR_0xe7f5807a_default_b479282946&amp;amp;utm_medium=external&amp;amp;utm_source=blog" rel="noopener noreferrer"&gt;ADK Crash Course - From Beginner to Expert&lt;/a&gt;" (it comes with a link to claim some free GCP credits to get you through the course).&lt;/p&gt;

</description>
      <category>rag</category>
      <category>genai</category>
      <category>opensource</category>
      <category>adk</category>
    </item>
  </channel>
</rss>
