DEV Community

Glen Yu for Google Developer Experts

Posted on

Implementing a RAG system: Run

In the "Crawl" and "Walk" phases, I introduced the basics of RAG and explored ways to optimize the pipeline to increase efficiency and accuracy. Armed with this knowledge, it's time to productionize our learnings.

Run

In the "Crawl" and "Walk" phases, we explored RAG fundamentals using local tools, proving how much document processing and re-ranking impact performance. While you could certainly scale those manual workflows into production, do you really wan to manage the infrastructure, data pipelines and scaling hurdles yourself?

Welcome to the "Run" phase. Here we leverage Google Cloud's Vertex AI RAG Engine - a fully managed solution that automates the entire pipeline so you can focus on building, not maintenance.

Vertex AI RAG Engine

Vertex AI RAG Engine is a low-code, fully managed solution for building AI applications on private data. It handles the ingestion, document processing, embedding, retrieval, ranking, and grounding to ensure that the response is highly accurate and relevant.

Optional document pre-processing with Docling

Though RAG Engine comes with its own parser options, I still opted to pre-process my documents using Docling first and upload them to a Google Cloud Storage bucket for ingestion:

if __name__ == "__main__":
    DATA_URL = "https://storage.googleapis.com/public-file-server/genai-downloads/bc_hr_policies.tgz"
    DATA_DIR = download_and_extract(DATA_URL)

    docling_docs = Path("./docling_docs")
    docling_docs.mkdir(parents=True, exist_ok=True)

    print("> Processing PDFs...")
    data_path = Path(DATA_DIR)
    pdf_files = list(data_path.glob("*.pdf"))

    for file in pdf_files:
        try:
            result = converter.convert(file)
            markdown_content = result.document.export_to_markdown()

            docling_doc_name = Path(file).stem + ".md"
            with open(f"docling_docs/{docling_doc_name}", "w", encoding="utf-8") as f:
                f.write(markdown_content)
        except Exception as e:
            print(f"Error on {file}: {e}")

    print("> Uploading Docling docs to GCS...")
    upload_folder_to_gcs("./docling_docs", GCS_BUCKET, GCS_BUCKET_PATH)
Enter fullscreen mode Exit fullscreen mode

One benefit of doing it this way is I get more visibility and control over the document processing step and can validate the contents from the original PDFs with the Docling documents (Markdown). This allows me to use the Default parsing libraries option, which is also free compared to the LLM parser and Document AI layout parser options which have an additional cost and setup component to them.

Chunking strategy

I do lose out on the benefits of the hybrid chunking strategy that Docling would have provided (as seen in the "Walk" phase), because that is determined by the layout parser that I choose here. If I wasn't using Docling, I think the LLM parser would be the parser option that I'd be gravitating towards.

Vector database options

When it comes to vector database options, you'll see several choices in the RAG Engine menu (including regional "Preview" features). I chose the RagManaged Cloud Spanner (also referred to as the RagManagedDb) because it offers the fastest path from data to insights with the least amount of infrastructure management. While Spanner is typically an enterprise-grade database, the RAG Engine allows you to spin it up on the Basic tier. This allocates 100 processing units, which is 10% of a Spanner node, making it perfect for smaller datasets while still giving you the reliability of a managed service without the enterprise-grade cost.

Spanner basic tier

IMPORTANT: Even on basic tier, this will still run you about $65 USD/month, so please remember to delete and clean up this RAG corpus once you're done experimenting with it.

Embedding model and vector DB

For those prioritizing flexibility, the RAG Engine also supports third-party options like Pinecone and Weaviate. These are excellent choices if portability is a requirement, allowing you to maintain a consistent vector store even if you decide to shift parts of your RAG stack to a different cloud provider or platform later on.

Ranking & grounding included

Once the RAG corpus is created, you can perform some manual testing to validate. When you ask RAG Engine for search results, re-ranking and grounding is done automatically to ensure relevance and correctness:

RAG Engine corpus test

Model Armor

In a production setting (especially if it's going to be public facing), you will want guardrails. I've written about Guardrails with Agent Development Kit in the past and is implemented through callbacks within ADK. It works the same way here and can be used to inspect text as it flows into and out of the LLM/agent. Key capabilities include:

  • Prompt injection & jailbreak detection: Attempts to trick the AI into ignoring instructions
  • Sensitive Data Protection: Natively integrates with Google's Data Loss Prevention (DLP) to scale for various types of sensitive information (PII)
  • Malicious URL detection
  • Responsible AI (RAI) filters: Hate speech, harassment, dangerous, and sexually explicit content

I configured my Model Armor policy template which I then invoked via the Python SDK library to determine whether the text was safe.

Updated example

You can find the code for "Run" phase → here

Querying the vector database (RAG Engine in this case) is a lot less involved as I don't have to write a many of the logic to pass the semantic search results to a re-ranker, because RAG Engine takes care of all of that for me! 

I once again ran the same two benchmark questions as I did in the "Crawl" and "Walk" phases:

HR RAG ADK Agent w/Gemini 3.1 Pro Preview + RAG Engine

I threw in a couple of extra questions to make sure Model Armor wasn't sleeping on the job, but overall I liked the detail and accuracy of the answers I was provided.

HR RAG ADK Agent w/Gemini 3.1 Pro Preview + RAG Engine + Model Armor

NOTE: In my updated example, I only added a before_model_callback, meaning I'm only checking the input prompts and not the response. An after_model_callback should be implemented to ensure the generated response is also scanned and preventing the AI from accidentally leaking sensitive internal data it might have retrieved from the RAG corpus (I omitted the output check here simply because I know there's no sensitive data in this particular dataset).

Summary

The purpose of this "Crawl, Walk, Run" series was to take you on a journey from managing code to delivering value. In the earlier phases, we deconstructed the mechanics of RAG works to understand the roles that chunking, embedding, re-ranking play in the overall system. In this final phase, we see how Vertex AI RAG Engine and Model Armor streamline those manual components. By offloading infrastructure management and safety logic to Google Cloud's managed services, you can ensure your system is scalable, accurate, and secure from day one.

Next Steps

Currently in private preview is Vector Search 2.0 with RAG, but reading through its documentation and features, it looks pretty interesting, so once it becomes GA, I will definitely give it a try.
I'm also looking forward to all the new AI-related announcements that is sure to happen at Google Next!

Additional learning

Interested in finding out more about how to secure your agent by sanitizing input and output? Try out this Model Armor Codelab! 
Vertex AI RAG Engine isn't your only option for a managed RAG, if you'd like to try a different option that uses Vector AI Search, might I suggest the Building Agents with Retrieval-Augmented Generation Codelab?

Happy learning!

Top comments (0)