Understanding Retrieval-Augmented Generation: A Deep Dive into Abhinav Kimothi’s Comprehensive Guide

#rag #books
A book review of “A Simple Guide to Retrieval Augmented Generation”
Transparency Disclosure: As always, I want to clarify that I maintain no financial ties, professional affiliations, or promotional agreements with Manning Publications or the author of this book. My reviews are entirely independent and self-driven. That said, I am a long-standing admirer of Manning’s library; their commitment to quality and technical depth consistently makes (most of… I insist) their titles ‘must-reads’ for anyone looking to level up their skills. This synthesis is born purely out of my appreciation for their work and a desire to share valuable insights with my readers.

All images provided are from Manning editions and the author’s GitHub repository.
Introduction

Although I have been deeply immersed in the practical implementation and testing of RAG architectures for some time — working across a diverse ecosystem of vector databases including AstraDB (since recently), Milvus, Elasticsearch, Qdrant, ChromaDB, and even SQLite — I never hesitate to invest in a new title. My reasoning is twofold: first, I recognize that in a field this vast, no one truly ‘knows it all’ (I mean myself 🫠), and there is always room to refine one’s perspective. Second, in a landscape defined by such rapid technological shifts, I’ve found that there are almost always hidden gems — unique insights or novel optimizations — tucked away in well-crafted books and articles.
What follows is the synthesis of the book, which is well written and nicely structured.
Understanding Retrieval-Augmented Generation: A Deep Dive into Abhinav Kimothi’s Comprehensive Guide

Large Language Models (LLMs) are revolutionary, but they aren’t perfect. They are prone to “hallucinations” — confident but factually incorrect responses — and they lack access to real-time or proprietary data. In his book, A Simple Guide to Retrieval Augmented Generation, Abhinav Kimothi provides a roadmap for solving these issues using Retrieval-Augmented Generation (RAG).
RAG allows an LLM to retrieve relevant, factual information from external sources before generating a response, ensuring the output is grounded in reality.
The Two Pillars of RAG: Indexing and Generation
A RAG system is built on two primary pipelines that work in tandem to turn static data into actionable knowledge.
1. The Indexing Pipeline (The Foundation)

The indexing pipeline is typically an offline process where you build your “knowledge base”. Think of this as the library that the AI will consult later. It involves four key steps:
Data Loading: Sourcing and cleaning data from various formats like PDFs, web pages, or databases.
Chunking: Breaking long documents into smaller, manageable pieces to fit within the “context window” of an LLM.
Embeddings: Converting these text chunks into numerical vectors (mathematical representations) that capture their semantic meaning.
Vector Storage: Saving these embeddings in a specialized vector database (like FAISS or Pinecone) for fast searching.
2. The Generation Pipeline (The Real-Time Brain)

This pipeline handles the actual interaction when a user asks a question.
Retrieval: The system searches the vector database for the most relevant chunks of information.
Augmentation: The user’s original question is “augmented” with the retrieved information into a single prompt.
Generation: The LLM reads the prompt and the provided context to generate a factual response.
Measuring Success: RAG Evaluation

Building a RAG system is one thing; ensuring it works accurately is another. Kimothi emphasizes the importance of evaluation through frameworks like RAGAs (Retrieval-Augmented Generation Assessment). Evaluation typically focuses on three pillars:
Faithfulness: Is the answer derived solely from the retrieved context?
Answer Relevance: Does the response actually address the user’s question?
Context Precision: How relevant was the information retrieved from the database?
From Naïve to Advanced RAG

While “Naïve RAG” follows a simple retrieve-and-generate flow, it can struggle with complex queries. The book explores advanced variants to overcome these hurdles:
Multimodal RAG: Handling data beyond just text, such as images and videos.
Knowledge Graph RAG: Using structured relationships between data points for better reasoning.
Agentic RAG: Using AI agents to autonomously decide how to route queries or which tools to use for the best answer.
Last but not least, I have a strong preference for technical books that bridge the gap between theory and practice with hands-on sample code. This particular title is excellent in that regard, providing comprehensive code samples across four key chapters to help readers test and validate the concepts immediately. To give you a taste of the practical application found in the book, here is an example from the section: ‘Indexing Pipeline: Creating a Knowledge Base for RAG-based Applications.’”
Chapter 03 - Indexing Pipeline: Creating a Knowledge Base for RAG-based Applications
Welcome to chapter 3 of A Simple Introduction to Retrieval Augmented Generation. This is the first chapter of the book where we use code examples in python.
In this chapter, we introduce the concepts behind the indexing pipeline that facilitates building a knowledge base for RAG enabled applications

The example that we have been following is to ask the question - Who won the 2023 Cricket World Cup? to a Large Language Model. We are fetching context from a Wikipedia Article - https://en.wikipedia.org/wiki/2023_Cricket_World_Cup

We first did this manually with ChatGPT. Now, we will do this programatically. We will be using a very popular orchestration framework in python called LangChain - https://www.langchain.com/


Installing Dependencies
All the necessary libraries for running this notebook along with their versions can be found in requirements.txt file in the root directory of this repository

You should go to the root directory and run the following command to install the libraries

pip install -r requirements.txt
This is the recommended method of installing the dependencies

Alternatively, you can run the command from this notebook too. The relative path may vary

%pip install -r ../../requirements.txt --quiet
Note: you may need to restart the kernel to use updated packages.
1. Loading Data
The first step towards building a knowledge base (or non-parametric memory) of a RAG-enabled system is to source data from its original location. This data may be in the form of word documents, pdf files, csv, HTML etc. Further, the data may be stored in file, block or object stores, in data lakes, data warehouses or even in third party sources that can be accessed via the open internet. This process of sourcing data from its original location is called Data Loading.

Data Loading includes the following four steps:

Connection to the source of the data
Extraction and Parsing of text from the source format
Reviewing and updating metadata information
Cleaning or transforming the data

Connecting & Parsing an external URL
We will now use LangChain to connect to Wikipedia and extract data from the page about the 2023 Cricket World Cup. For this we will use the AsyncHtmlLoader function from the document_loaders library in the langchain-community package.

Let us load the url of our example i.e. the Wikipedia Page of the 2023 Cricket World Cup

import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

#This is the url of the wikipedia page on the 2023 Cricket World Cup
url="https://en.wikipedia.org/wiki/2023_Cricket_World_Cup"
#Import library
from langchain_community.document_loaders import AsyncHtmlLoader

#Instantiate the AsyncHtmlLoader object
loader = AsyncHtmlLoader (url)

#Loading the extracted information
html_data = loader.load()
To verify the extracted text and the metadata, let us print a few tokens

import textwrap

print(textwrap.fill(f"First 1000 characters of extracted content -\n\n{html_data[0].page_content[:1000]}", width=150))
First 1000 characters of extracted content -  <!DOCTYPE html> <html class="client-nojs vector-feature-language-in-header-enabled vector-feature-
language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-
pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1
vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-
available" lang="en" dir="ltr"> <head> <meta charset="UTF-8"> <title>2023 Cricket World Cup - Wikipedia</title> <script>(function(){var
className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-
disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-
limited-width-content-enabled vec
Metadata Review
print(f"Metadata information - \n\n{html_data[0].metadata}")
Metadata information - 

{'source': 'https://en.wikipedia.org/wiki/2023_Cricket_World_Cup', 'title': '2023 Cricket World Cup - Wikipedia', 'language': 'en'}
We can see that some content has been extracted. Also, some metadata information is present.

Document Transformation
The content is in HTML format which does not convey a lot of factual information.

LangChain also provides a bunch of document transformers for converting formats.

We will now transform this data into a readable format using Html2TextTransformer class.

from langchain_community.document_transformers import Html2TextTransformer

#Instantiate the Html2TextTransformer function
html2text = Html2TextTransformer()


#Call transform_documents
html_data_transformed = html2text.transform_documents(html_data)
Let us review the extracted content, now transformed by the Html2TextTransformer

print(f"First 100 characters of extracted content -\n\n{html_data_transformed[0].page_content[:1000]}")
First 100 characters of extracted content -

Jump to content

Main menu

Main menu

move to sidebar hide

Navigation

  * Main page
  * Contents
  * Current events
  * Random article
  * About Wikipedia
  * Contact us

Contribute

  * Help
  * Learn to edit
  * Community portal
  * Recent changes
  * Upload file
  * Special pages

Search

Search

Appearance

  * Donate
  * Create account
  * Log in

Personal tools

  * Donate
  * Create account
  * Log in

Pages for logged out editors learn more

  * Contributions
  * Talk

## Contents

move to sidebar hide

  * (Top)

  * 1 Background

Toggle Background subsection

    * 1.1 Host selection

    * 1.2 COVID-19 pandemic

    * 1.3 Format

    * 1.4 Pakistan's participation

    * 1.5 Prize money

    * 1.6 Marketing

  * 2 Qualification

  * 3 Venues

  * 4 Squads

  * 5 Match officials

  * 6 Warm-up matches

  * 7 Group stage

Toggle Group stage subsection

    * 7.1 Points table

    * 7.2 Results

  * 8 Knockout stage

Toggle Knockout stage subsection

    * 8.1 Semi-finals


Now, we see that we have text in a readable english language!

Optional: BeautifulSoupTransformer
But you may notics that there's a lot of information like Menu Options, Header and footer information that may not be very useful.

Another options is the BeautifulSoupTransformer in LangChain that allows you to extract specific tags from HTML pages. Let us extract information contained in 'p' tags.

from langchain_community.document_transformers import BeautifulSoupTransformer

soup_transformer = BeautifulSoupTransformer()

html_data_p_tags = soup_transformer.transform_documents(html_data, tags_to_extract=["p"])
print(textwrap.fill(
f"First 100 characters of extracted content -\n\n{html_data_p_tags[0].page_content[:1000]}", width=100))
First 100 characters of extracted content -  The 2023 ICC Men's Cricket World Cup was the 13th
edition of the ICC Men's Cricket World Cup (/wiki/ICC_Men%27s_Cricket_World_Cup) , a quadrennial One
Day International (/wiki/One_Day_International) (ODI) cricket (/wiki/Cricket) tournament organized
by the International Cricket Council (/wiki/International_Cricket_Council) (ICC). It was hosted from
5 October to 19 November 2023 across ten venues in India (/wiki/India) . This was the fourth World
Cup held in India, but the first where India was the sole host. The tournament was contested by ten
national teams, maintaining the same format used in 2019 (/wiki/2019_Cricket_World_Cup) . After six
weeks of round-robin matches, India (/wiki/India_national_cricket_team) , South Africa
(/wiki/South_Africa_national_cricket_team) , Australia (/wiki/Australia_national_cricket_team) , and
New Zealand (/wiki/New_Zealand_national_cricket_team) finished as the top four and qualified for the
knockout stage. In the knockout stage, India and Australia be
Congratulations
With this, we have successfully completed the data loading step of the indexing pipeline. We move now to the next step of Chunking

But before that, check out the document loaders and transformers available in LangChain

Document Loaders - [https://python.langchain.com/docs/integrations/document_loaders/]

Document Transformers - [https://python.langchain.com/docs/integrations/document_transformers/]

2. Data Splitting (Chunking)
Breaking down long pieces of text into manageable sizes is called Data Splitting or Chunking. This is done for various reasons like Context Window Limitations, Search Complexity, Lost in the middle kind of issues.

Understanding Chunking: What is it ?
In cognitive psychology, chunking is defined as process by which individual pieces of information are bound together into a meaningful whole. (https://psycnet.apa.org/record/2003-09163-002) and a chunk is a familiar collection of elementary units. The idea is that chunking is an essential technique through which human beings perceive the world and commit to memory. The simplest example is how we remember long sequences of digits like phone numbers, credit card numbers, dates or even OTPs. We don’t remember the entire sequences but in our minds, we break them down into chunks.

The role of chunking in RAG and the underlying idea is somewhat similar to what it is in real life. Once you’ve extracted and parsed text from the source, instead of committing it all to memory as a single element, you break it down into smaller chunks.

Breaking down long pieces of text into manageable sizes is called Chunking

Understanding Chunking: Why is it necessary ?
There are two main benefits of chunking —

It leads to better retrieval of information. If a chunk represents a single idea (or fact) it can be retrieved with more confidence that if there are multiple ideas (or facts) within the same chunk.
It leads to better generation. The retrieved chunk has information that is focussed on the user query and does not have any other text that may confuse the LLM. Therefore, the generation is more accurate and coherent.
Apart from these two benefits there are two limitations of LLMs that chunking addresses.

Context Window of LLMs: LLMs, due to the inherent nature of the technology, have a limit on the number of tokens (loosely, words) they can work with at a time. This includes both the number of tokens in the prompt (or the input) and the number of tokens in the completion (or the output). The limit on the total number of tokens that an LLM can process in one go is called the context window size. If we pass an input that is longer than the context window size, the LLM chooses to ignore all text beyond the size. It becomes very important to be careful with the amount to text that is being passed to the LLM.

Lost in the middle problem: Even in those LLMs which have a long context window (Claude 3 by Anthropic has a context window of up to 200,00 tokens), an issue with accurately reading the information has been observed. It has been noticed that accuracy declines dramatically if the relevant information is somewhere in the middle of the prompt. This problem can be addressed by passing only the relevant information to the LLM instead of the entire document.

Fixed Size Chunking
A very common approach is to pre-determine the size of the chunk and the amount of overlap between the chunks. There are several chunking methods that follow a fixed size chunking approach.

Character-Based Chunking: Chunks are created based on a fixed number of characters

Token-Based Chunking: Chunks are created based on a fixed number of tokens.

Sentence-Based Chunking: Chunks are defined by a fixed number of sentences

Paragraph-Based Chunking: Chunks are created by dividing the text into a fixed number of paragraphs.

Let's try Character-Based Chunking.

from langchain.text_splitter import CharacterTextSplitter #Character Based Text Splitter from LangChain

text_splitter = CharacterTextSplitter(
separator="\n", #The character that should be used to split.
chunk_size=1000, #Number of characters in each chunk 
chunk_overlap=100, #Number of overlapping characters between chunks
)

text_chunks=text_splitter.create_documents([html_data_transformed[0].page_content])

#Show the number of chunks created
print(f"The number of chunks created : {len(text_chunks)}")
The number of chunks created : 61
In all, this method created 61 chunks. But what about the overlap. Let us check two chunks at random, say, chunk 4 and chunk 5. We will compare the last 200 characters of chunk 4 with the first 200 characters of chunk 5.

text_chunks[4].page_content[-200:]
'22 \n    * N/A\n    * 9\n    * 10\n    * 11\n    * 12\n    * 13\n    * 14\n    * 15\n    * 16\n    * 17\n    * 18\n  * 2023 \n    * 19\n    * N/A\n    * 20\n    * 21\n  \nChallenge League\n  * 2019–2022 Challenge League'
text_chunks[5].page_content[:200]
'* 2023 \n    * 19\n    * N/A\n    * 20\n    * 21\n  \nChallenge League\n  * 2019–2022 Challenge League\n  * A \n    * 2019\n    * 2021 (2022)\n    * 2020 (2022)\n  * B \n    * 2019\n    * 2020 (2022)\n    * 2021 (20'
Now, let's see the size distribution of the chunks that have been created

import matplotlib.pyplot as plt
import numpy as np


data = [len(doc.page_content) for doc in text_chunks]

plt.boxplot(data)  
plt.title('Box Plot of chunk lengths') 
plt.xlabel('Chunk Lengths')  
plt.ylabel('Values') 

plt.show()

print(f"The median chunk length is : {round(np.median(data),2)}")
print(f"The average chunk length is : {round(np.mean(data),2)}")
print(f"The minimum chunk length is : {round(np.min(data),2)}")
print(f"The max chunk length is : {round(np.max(data),2)}")
print(f"The 75th percentile chunk length is : {round(np.percentile(data, 75),2)}")
print(f"The 25th percentile chunk length is : {round(np.percentile(data, 25),2)}")

The median chunk length is : 973.0
The average chunk length is : 944.64
The minimum chunk length is : 156
The max chunk length is : 1000
The 75th percentile chunk length is : 986.0
The 25th percentile chunk length is : 940.0
Document-structured based Chunking
The aim of chunking is to keep meaningful data together. If we are dealing with data in form of HTML, Markdown, JSON or even computer code, it makes more sense to split the data based on the structure rather than a fixed size. Another approach for chunking is to take into consideration the format of the extracted and loaded data. A markdown file, for example is organised by headers, a code written in a programming language like python or java is organized by classes and functions and HTML, likewise, is organised in headers and sections. For such formats a specialised chunking approach can be employed.

Examples of structure-based splitting:

Markdown: Split based on headers (e.g., #, ##, ###)
HTML: Split using tags
JSON: Split by object or array elements
Code: Split by functions, classes, or logical blocks
Let's recollect out HTML document from the url.

loader = AsyncHtmlLoader (url)

html_data = loader.load()
To split the HTML text based on tags (e.g., h1, section, table, etc.), LangChain provides HTMLSectionSplitter. It splits the text and adds metadata for each section. Let's take a look.

from langchain_text_splitters import HTMLSectionSplitter

sections_to_split_on = [
    ("h1", "Header 1"),
     ("h2", "Header 2"),
     ("table","Table"),
     #("div", "Div"),
     #("img","Image"),
     ("p","P"),


]

splitter = HTMLSectionSplitter(sections_to_split_on)

split_content = splitter.split_text(html_data[0].page_content)
The above document object 'split_content' will have chunks divided based on the provided HTML tags. Let's look at the top 10 documents.

split_content[:10]
[Document(metadata={'Header 1': '#TITLE#'}, page_content='Jump to content \n \n \n \n \n \n \n \n Main menu \n \n \n \n \n \n Main menu \n move to sidebar \n hide \n \n \n \n\t\tNavigation\n\t \n \n \n Main page \n Contents \n Current events \n Random article \n About Wikipedia \n Contact us \n \n \n \n \n \n\t\tContribute\n\t \n \n \n Help \n Learn to edit \n Community portal \n Recent changes \n Upload file \n Special pages \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n Search \n \n \n \n \n \n \n \n \n \n \n \n Search \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n Appearance \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n Donate \n \n \n Create account \n \n \n Log in \n \n \n \n \n \n \n \n \n Personal tools \n \n \n \n \n \n Donate \n   Create account \n   Log in \n \n \n \n \n \n\t\tPages for logged out editors  learn more \n \n \n \n Contributions \n Talk \n \n \n \n \n \n \n \n \n \n \n \n \n  CentralNotice'),
 Document(metadata={'Header 2': 'Contents'}, page_content="Contents \n move to sidebar \n hide \n \n \n \n \n (Top) \n \n \n \n \n \n 1 \n Background \n \n \n \n \n Toggle Background subsection \n \n \n \n \n \n 1.1 \n Host selection \n \n \n \n \n \n \n \n \n 1.2 \n COVID-19 pandemic \n \n \n \n \n \n \n \n \n 1.3 \n Format \n \n \n \n \n \n \n \n \n 1.4 \n Pakistan's participation \n \n \n \n \n \n \n \n \n 1.5 \n Prize money \n \n \n \n \n \n \n \n \n 1.6 \n Marketing \n \n \n \n \n \n \n \n \n \n \n 2 \n Qualification \n \n \n \n \n \n \n \n \n 3 \n Venues \n \n \n \n \n \n \n \n \n 4 \n Squads \n \n \n \n \n \n \n \n \n 5 \n Match officials \n \n \n \n \n \n \n \n \n 6 \n Warm-up matches \n \n \n \n \n \n \n \n \n 7 \n Group stage \n \n \n \n \n Toggle Group stage subsection \n \n \n \n \n \n 7.1 \n Points table \n \n \n \n \n \n \n \n \n 7.2 \n Results \n \n \n \n \n \n \n \n \n \n \n 8 \n Knockout stage \n \n \n \n \n Toggle Knockout stage subsection \n \n \n \n \n \n 8.1 \n Semi-finals \n \n \n \n \n \n \n \n \n 8.2 \n Final \n \n \n \n \n \n \n \n \n \n \n 9 \n Statistics \n \n \n \n \n Toggle Statistics subsection \n \n \n \n \n \n 9.1 \n Most runs \n \n \n \n \n \n \n \n \n 9.2 \n Most wickets \n \n \n \n \n \n \n \n \n 9.3 \n Team of the tournament \n \n \n \n \n \n \n \n \n \n \n 10 \n Broadcasting \n \n \n \n \n \n \n \n \n 11 \n See also \n \n \n \n \n \n \n \n \n 12 \n References \n \n \n \n \n \n \n \n \n 13 \n External links \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n Toggle the table of contents"),
 Document(metadata={'Header 1': '2023 Cricket World Cup'}, page_content="2023 Cricket World Cup \n \n \n \n 33 languages \n \n \n \n \n Afrikaans \n العربية \n অসমীয়া \n বাংলা \n Deutsch \n Español \n Français \n ગુજરાતી \n हिन्दी \n Bahasa Indonesia \n Italiano \n ಕನ್ನಡ \n कॉशुर / کٲشُر \n മലയാളം \n मराठी \n مصرى \n Nederlands \n नेपाली \n 日本語 \n Oʻzbekcha / ўзбекча \n ਪੰਜਾਬੀ \n پنجابی \n Português \n Русский \n ᱥᱟᱱᱛᱟᱲᱤ \n සිංහල \n Simple English \n سنڌي \n தமிழ் \n తెలుగు \n Українська \n اردو \n 中文 \n \n Edit links \n \n \n \n \n \n \n \n \n \n \n \n Article \n Talk \n \n \n \n \n \n English \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n Read \n View source \n View history \n \n \n \n \n \n \n \n Tools \n \n \n \n \n \n Tools \n move to sidebar \n hide \n \n \n \n\t\tActions\n\t \n \n \n Read \n View source \n View history \n \n \n \n \n \n\t\tGeneral\n\t \n \n \n What links here \n Related changes \n Upload file \n Permanent link \n Page information \n Cite this page \n Get shortened URL \n Download QR code \n \n \n \n \n \n\t\tPrint/export\n\t \n \n \n Download as PDF \n Printable version \n \n \n \n \n \n\t\tIn other projects\n\t \n \n \n Wikimedia Commons \n Wikidata item \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n Appearance \n move to sidebar \n hide \n \n \n \n \n \n \n \n \n \n \n \n From Wikipedia, the free encyclopedia \n \n \n \n \n 13th edition of the ICC Men's Cricket World Cup"),
 Document(metadata={'P': ''}, page_content='Cricket tournament \n .mw-parser-output .infobox-subbox{padding:0;border:none;margin:-3px;width:auto;min-width:100%;font-size:100%;clear:none;float:none;background-color:transparent}.mw-parser-output .infobox-3cols-child{margin:auto}.mw-parser-output .infobox .navbar{font-size:100%}@media screen{html.skin-theme-clientpref-night .mw-parser-output .infobox-full-data:not(.notheme)>div:not(.notheme)[style]{background:#1f1f23!important;color:#f8f9fa}}@media screen and (prefers-color-scheme:dark){html.skin-theme-clientpref-os .mw-parser-output .infobox-full-data:not(.notheme) div:not(.notheme){background:#1f1f23!important;color:#f8f9fa}}@media(min-width:640px){body.skin--responsive .mw-parser-output .infobox-table{display:table!important}body.skin--responsive .mw-parser-output .infobox-table>caption{display:table-caption!important}body.skin--responsive .mw-parser-output .infobox-table>tbody{display:table-row-group}body.skin--responsive .mw-parser-output .infobox-table tr{display:table-row!important}body.skin--responsive .mw-parser-output .infobox-table th,body.skin--responsive .mw-parser-output .infobox-table td{padding-left:inherit;padding-right:inherit}}'),
 Document(metadata={'Table': "2023 ICC Men's Cricket World Cup\n\n\n\nDates\n5 October – 19 November 2023\n\n\nAdministrator(s)\nInternational Cricket Council\n\n\nCricket format\n\nOne Day International (ODI)\n\n\nTournament format(s)\n\nRound-robin and knockout\n\n\n\nHost(s)\nIndia\n\n\nChampions\n\n\xa0Australia (6th title)\n\n\nRunners-up\n\n\xa0India\n\n\n\nParticipants\n10\n\n\nMatches\n48\n\n\nAttendance\n1,250,307 (26,048 per match)\n\n\nPlayer of the series\n\n Virat Kohli\n\n\n\nMost runs\n\n Virat Kohli (765)\n\n\nMost wickets\n\n Mohammed Shami (24)\n\n\nOfficial website\ncricketworldcup.com\n\n\n← 2019\n\n\n2027 →"}, page_content='2023 ICC Men\'s Cricket World Cup \n \n \n \n Dates \n 5 October – 19 November 2023 \n \n \n Administrator(s) \n International Cricket Council \n \n \n Cricket format \n \n One Day International  (ODI) \n \n \n Tournament format(s) \n \n Round-robin  and  knockout \n \n \n \n Host(s) \n India \n \n \n Champions \n \n \xa0 Australia  (6th title) \n \n \n Runners-up \n \n \xa0 India \n \n \n \n Participants \n 10 \n \n \n Matches \n 48 \n \n \n Attendance \n 1,250,307 (26,048 per match) \n \n \n Player of the series \n \n   Virat Kohli \n \n \n \n Most runs \n \n   Virat Kohli  (765) \n \n \n Most wickets \n \n   Mohammed Shami  (24) \n \n \n Official website \n cricketworldcup.com \n \n \n ←  2019 \n \n \n 2027  → \n \n \n \n .mw-parser-output .hlist dl,.mw-parser-output .hlist ol,.mw-parser-output .hlist ul{margin:0;padding:0}.mw-parser-output .hlist dd,.mw-parser-output .hlist dt,.mw-parser-output .hlist li{margin:0;display:inline}.mw-parser-output .hlist.inline,.mw-parser-output .hlist.inline dl,.mw-parser-output .hlist.inline ol,.mw-parser-output .hlist.inline ul,.mw-parser-output .hlist dl dl,.mw-parser-output .hlist dl ol,.mw-parser-output .hlist dl ul,.mw-parser-output .hlist ol dl,.mw-parser-output .hlist ol ol,.mw-parser-output .hlist ol ul,.mw-parser-output .hlist ul dl,.mw-parser-output .hlist ul ol,.mw-parser-output .hlist ul ul{display:inline}.mw-parser-output .hlist .mw-empty-li{display:none}.mw-parser-output .hlist dt::after{content:": "}.mw-parser-output .hlist dd::after,.mw-parser-output .hlist li::after{content:" · ";font-weight:bold}.mw-parser-output .hlist dd:last-child::after,.mw-parser-output .hlist dt:last-child::after,.mw-parser-output .hlist li:last-child::after{content:none}.mw-parser-output .hlist dd dd:first-child::before,.mw-parser-output .hlist dd dt:first-child::before,.mw-parser-output .hlist dd li:first-child::before,.mw-parser-output .hlist dt dd:first-child::before,.mw-parser-output .hlist dt dt:first-child::before,.mw-parser-output .hlist dt li:first-child::before,.mw-parser-output .hlist li dd:first-child::before,.mw-parser-output .hlist li dt:first-child::before,.mw-parser-output .hlist li li:first-child::before{content:" (";font-weight:normal}.mw-parser-output .hlist dd dd:last-child::after,.mw-parser-output .hlist dd dt:last-child::after,.mw-parser-output .hlist dd li:last-child::after,.mw-parser-output .hlist dt dd:last-child::after,.mw-parser-output .hlist dt dt:last-child::after,.mw-parser-output .hlist dt li:last-child::after,.mw-parser-output .hlist li dd:last-child::after,.mw-parser-output .hlist li dt:last-child::after,.mw-parser-output .hlist li li:last-child::after{content:")";font-weight:normal}.mw-parser-output .hlist ol{counter-reset:listitem}.mw-parser-output .hlist ol>li{counter-increment:listitem}.mw-parser-output .hlist ol>li::before{content:" "counter(listitem)"\\a0 "}.mw-parser-output .hlist dd ol>li:first-child::before,.mw-parser-output .hlist dt ol>li:first-child::before,.mw-parser-output .hlist li ol>li:first-child::before{content:" ("counter(listitem)"\\a0 "} \n .mw-parser-output .sidebar{width:22em;float:right;clear:right;margin:0.5em 0 1em 1em;background:var(--background-color-neutral-subtle,#f8f9fa);border:1px solid var(--border-color-base,#a2a9b1);padding:0.2em;text-align:center;line-height:1.4em;font-size:88%;border-collapse:collapse;display:table}body.skin-minerva .mw-parser-output .sidebar{display:table!important;float:right!important;margin:0.5em 0 1em 1em!important}.mw-parser-output .sidebar-subgroup{width:100%;margin:0;border-spacing:0}.mw-parser-output .sidebar-left{float:left;clear:left;margin:0.5em 1em 1em 0}.mw-parser-output .sidebar-none{float:none;clear:both;margin:0.5em 1em 1em 0}.mw-parser-output .sidebar-outer-title{padding:0 0.4em 0.2em;font-size:125%;line-height:1.2em;font-weight:bold}.mw-parser-output .sidebar-top-image{padding:0.4em}.mw-parser-output .sidebar-top-caption,.mw-parser-output .sidebar-pretitle-with-top-image,.mw-parser-output .sidebar-caption{padding:0.2em 0.4em 0;line-height:1.2em}.mw-parser-output .sidebar-pretitle{padding:0.4em 0.4em 0;line-height:1.2em}.mw-parser-output .sidebar-title,.mw-parser-output .sidebar-title-with-pretitle{padding:0.2em 0.8em;font-size:145%;line-height:1.2em}.mw-parser-output .sidebar-title-with-pretitle{padding:0.1em 0.4em}.mw-parser-output .sidebar-image{padding:0.2em 0.4em 0.4em}.mw-parser-output .sidebar-heading{padding:0.1em 0.4em}.mw-parser-output .sidebar-content{padding:0 0.5em 0.4em}.mw-parser-output .sidebar-content-with-subgroup{padding:0.1em 0.4em 0.2em}.mw-parser-output .sidebar-above,.mw-parser-output .sidebar-below{padding:0.3em 0.8em;font-weight:bold}.mw-parser-output .sidebar-collapse .sidebar-above,.mw-parser-output .sidebar-collapse .sidebar-below{border-top:1px solid #aaa;border-bottom:1px solid #aaa}.mw-parser-output .sidebar-navbar{text-align:right;font-size:115%;padding:0 0.4em 0.4em}.mw-parser-output .sidebar-list-title{padding:0 0.4em;text-align:left;font-weight:bold;line-height:1.6em;font-size:105%}.mw-parser-output .sidebar-list-title-c{padding:0 0.4em;text-align:center;margin:0 3.3em}@media(max-width:640px){body.mediawiki .mw-parser-output .sidebar{width:100%!important;clear:both;float:none!important;margin-left:0!important;margin-right:0!important}}body.skin--responsive .mw-parser-output .sidebar a>img{max-width:none!important}@media screen{html.skin-theme-clientpref-night .mw-parser-output .sidebar:not(.notheme) .sidebar-list-title,html.skin-theme-clientpref-night .mw-parser-output .sidebar:not(.notheme) .sidebar-title-with-pretitle{background:transparent!important}html.skin-theme-clientpref-night .mw-parser-output .sidebar:not(.notheme) .sidebar-title-with-pretitle a{color:var(--color-progressive)!important}}@media screen and (prefers-color-scheme:dark){html.skin-theme-clientpref-os .mw-parser-output .sidebar:not(.notheme) .sidebar-list-title,html.skin-theme-clientpref-os .mw-parser-output .sidebar:not(.notheme) .sidebar-title-with-pretitle{background:transparent!important}html.skin-theme-clientpref-os .mw-parser-output .sidebar:not(.notheme) .sidebar-title-with-pretitle a{color:var(--color-progressive)!important}}@media print{body.ns-0 .mw-parser-output .sidebar{display:none!important}} \n \n \n .mw-parser-output .tmulti .multiimageinner{display:flex;flex-direction:column}.mw-parser-output .tmulti .trow{display:flex;flex-direction:row;clear:left;flex-wrap:wrap;width:100%;box-sizing:border-box}.mw-parser-output .tmulti .tsingle{margin:1px;float:left}.mw-parser-output .tmulti .theader{clear:both;font-weight:bold;text-align:center;align-self:center;background-color:transparent;width:100%}.mw-parser-output .tmulti .thumbcaption{background-color:transparent}.mw-parser-output .tmulti .text-align-left{text-align:left}.mw-parser-output .tmulti .text-align-right{text-align:right}.mw-parser-output .tmulti .text-align-center{text-align:center}@media all and (max-width:720px){.mw-parser-output .tmulti .thumbinner{width:100%!important;box-sizing:border-box;max-width:none!important;align-items:center}.mw-parser-output .tmulti .trow{justify-content:center}.mw-parser-output .tmulti .tsingle{float:none!important;max-width:100%!important;box-sizing:border-box;text-align:center}.mw-parser-output .tmulti .tsingle .thumbcaption{text-align:left}.mw-parser-output .tmulti .trow>.thumbcaption{text-align:center}}@media screen{html.skin-theme-clientpref-night .mw-parser-output .tmulti .multiimageinner span:not(.skin-invert-image):not(.skin-invert):not(.bg-transparent) img{background-color:white}}@media screen and (prefers-color-scheme:dark){html.skin-theme-clientpref-os .mw-parser-output .tmulti .multiimageinner span:not(.skin-invert-image):not(.skin-invert):not(.bg-transparent) img{background-color:white}}'),
 Document(metadata={'Table': "Part of a series on the\n\n2023 Cricket World Cup /2025 ICC Champions Trophy\n\n\n\n\n\n\nCWC:  Category •  CommonsCT:  Category •  Commons\n\n\n2023 Cricket World Cup\n\n\n\n\nBackground\n\n\nHost selection\nCOVID-19 pandemic\nFormat\nPakistan's participation\nPrize money\nMarketing\n\n\n\n\n\n\n\n\nStages\n\nWarm-up matches\nGroup stage\nKnockout stage\n\nSemi-finals\nFinal\n\n\n\n\n\n\n\n\n\nGeneral Information\n\n\nOfficials\nSquads\nStatistics\nVenues\n\n\n\n\n\n\nCWC QualificationOverview\n\n\n\n\nSuper League\n\n2020–2023 Super League\n2020\n\nAUS v ENG\nNED v ZIM\nIRE v ENG\n\n\n2020–21\n\nENG v IND\nIND v AUS\nBAN v NZ\nPAK v SA\nZIM v PAK\nZIM v SL\nENG v SA\nWIN v BAN\nIRE v AFG in UAE\nSL v WIN\n\n\n2021\n\nSL v BAN\nIRE v NED\nAUS v WIN\nBAN v ZIM\nSL v ENG\nPAK v ENG\nSA v IRE\nIND v SL\nAFG v SL\nZIM v IRE\n\n\n2021–22\n\nAFG v PAK\nSA v SL\nZIM v SL\nWIN v IND\nAFG v BAN\nAUS v PAK\nBAN v SA\nNED v SA\nNED v NZ\nIRE v WIN\nNED v AFG in Qatar\nAFG v IND\n\n\n2022\n\nNZ v IRE\nWIN v NED\nENG v NED\nWIN v PAK\nPAK v NED\nAFG v ZIM\nNZ v WIN\nIND v ZIM\nZIM v AUS\n\n\n2022–23\n\nNZ v AUS\nSA v IND\nAFG v SL\nNZ v PAK\nSA v AUS\nENG v SA\nENG v BAN\nSL v NZ\nNED v ZIM\n\n\n2023\nBAN v IRE\n\n\n\n\n\n\n\n\nLeague 2\n\n2019–2023 League 2\n2019\n\n1\n2\n3\n\n\n2020\n\n4\n5\n\n\n2021\n\n6\n7\n8\n\n\n2022\n\nN/A\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n\n\n2023\n\n19\nN/A\n20\n21\n\n\n\n\n\n\n\n\n\nChallenge League\n\n2019–2022 Challenge League\nA\n\n2019\n2021 (2022)\n2020 (2022)\n\n\nB\n\n2019\n2020 (2022)\n2021 (2022)\n\n\n\n\n\n\n\n\n\nCWC Qualifier\n\n\n2023 Qualifier Play-off\n2023 Qualifier\n\n\n\n\n\n\n2025 ICC Champions Trophy\n\n\n\n\nBackground\n\n\nHost selection\nFormat\nIndia's participation\nPrize money\nMarketing\n\n\n\n\n\n\n\n\nStages\n\nWarm-up matches\nGroup stage\n\nGroup A\nGroup B\n\n\nKnockout stage\n\nSemi-finals\nFinal\n\n\n\n\n\n\n\n\n\nGeneral Information\n\n\nOfficials\nSquads\nStatistics\nVenues\n\n\n\n\n\n\n← 2019 CWC 2027 →\n← 2017 CT 2029 →\n\n\n\n\n\nv\nt\ne"}, page_content='Part of a series on the \n \n 2023 Cricket World Cup  / 2025 ICC Champions Trophy \n \n \n \n \n \n \nCWC:    Category  •    Commons CT:    Category  •    Commons \n \n \n 2023 Cricket World Cup \n \n \n \n \n Background \n \n \n Host selection \n COVID-19 pandemic \n Format \n Pakistan\'s participation \n Prize money \n Marketing \n \n \n \n \n \n \n \n \n Stages \n \n Warm-up matches \n Group stage \n Knockout stage\n \n Semi-finals \n Final \n \n \n \n \n \n \n \n \n \n General Information \n \n \n Officials \n Squads \n Statistics \n Venues \n \n \n \n \n \n \nCWC Qualification Overview \n \n \n \n \n Super League \n \n 2020–2023 Super League \n 2020\n \n AUS v ENG \n NED v ZIM \n IRE v ENG \n \n \n 2020–21\n \n ENG v IND \n IND v AUS \n BAN v NZ \n PAK v SA \n ZIM v PAK \n ZIM v SL \n ENG v SA \n WIN v BAN \n IRE v AFG in UAE \n SL v WIN \n \n \n 2021\n \n SL v BAN \n IRE v NED \n AUS v WIN \n BAN v ZIM \n SL v ENG \n PAK v ENG \n SA v IRE \n IND v SL \n AFG v SL \n ZIM v IRE \n \n \n 2021–22\n \n AFG v PAK \n SA v SL \n ZIM v SL \n WIN v IND \n AFG v BAN \n AUS v PAK \n BAN v SA \n NED v SA \n NED v NZ \n IRE v WIN \n NED v AFG in Qatar \n AFG v IND \n \n \n 2022\n \n NZ v IRE \n WIN v NED \n ENG v NED \n WIN v PAK \n PAK v NED \n AFG v ZIM \n NZ v WIN \n IND v ZIM \n ZIM v AUS \n \n \n 2022–23\n \n NZ v AUS \n SA v IND \n AFG v SL \n NZ v PAK \n SA v AUS \n ENG v SA \n ENG v BAN \n SL v NZ \n NED v ZIM \n \n \n 2023\n BAN v IRE \n \n \n \n \n \n \n \n \n League 2 \n \n 2019–2023 League 2 \n 2019\n \n 1 \n 2 \n 3 \n \n \n 2020\n \n 4 \n 5 \n \n \n 2021\n \n 6 \n 7 \n 8 \n \n \n 2022\n \n N/A \n 9 \n 10 \n 11 \n 12 \n 13 \n 14 \n 15 \n 16 \n 17 \n 18 \n \n \n 2023\n \n 19 \n N/A \n 20 \n 21 \n \n \n \n \n \n \n \n \n \n Challenge League \n \n 2019–2022 Challenge League \n A\n \n 2019 \n 2021 (2022) \n 2020 (2022) \n \n \n B\n \n 2019 \n 2020 (2022) \n 2021 (2022) \n \n \n \n \n \n \n \n \n \n CWC Qualifier \n \n \n 2023 Qualifier Play-off \n 2023 Qualifier \n \n \n \n \n \n \n 2025 ICC Champions Trophy \n \n \n \n \n Background \n \n \n Host selection \n Format \n India\'s participation \n Prize money \n Marketing \n \n \n \n \n \n \n \n \n Stages \n \n Warm-up matches \n Group stage\n \n Group A \n Group B \n \n \n Knockout stage\n \n Semi-finals \n Final \n \n \n \n \n \n \n \n \n \n General Information \n \n \n Officials \n Squads \n Statistics \n Venues \n \n \n \n \n \n \n ← 2019  CWC  2027 → \n ← 2017  CT  2029 → \n \n \n \n .mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}html.skin-theme-clientpref-night .mw-parser-output .navbar li a abbr{color:var(--color-base)!important}@media(prefers-color-scheme:dark){html.skin-theme-clientpref-os .mw-parser-output .navbar li a abbr{color:var(--color-base)!important}}@media print{.mw-parser-output .navbar{display:none!important}} \n \n v \n t \n e'),
 Document(metadata={'P': "The 2023 ICC Men's Cricket World Cup was the 13th edition of the ICC Men's Cricket World Cup, a quadrennial One Day International (ODI) cricket tournament organized by the International Cricket Council (ICC). It was hosted from 5 October to 19 November 2023 across ten venues in India. This was the fourth World Cup held in India, but the first where India was the sole host."}, page_content="The  2023 ICC Men's Cricket World Cup  was the 13th edition of the  ICC Men's Cricket World Cup , a quadrennial  One Day International  (ODI)  cricket  tournament organized by the  International Cricket Council  (ICC). It was hosted from 5 October to 19 November 2023 across ten venues in  India . This was the fourth World Cup held in India, but the first where India was the sole host."),
 Document(metadata={'P': 'The tournament was contested by ten national teams, maintaining the same format used in 2019. After six weeks of round-robin matches, India, South Africa, Australia, and New Zealand finished as the top four and qualified for the knockout stage. In the knockout stage, India and Australia beat New Zealand and South Africa, respectively, to advance to the final, played on 19 November at the Narendra Modi Stadium in Ahmedabad. Australia won the final by six wickets, winning their sixth Cricket World Cup title.'}, page_content='The tournament was contested by ten national teams, maintaining the same format used in  2019 . After six weeks of round-robin matches,  India ,  South Africa ,  Australia , and  New Zealand  finished as the top four and qualified for the knockout stage. In the knockout stage, India and Australia beat New Zealand and South Africa, respectively, to advance to the final, played on 19 November at the  Narendra Modi Stadium  in  Ahmedabad . Australia won the final by six wickets, winning their sixth Cricket World Cup title.'),
 Document(metadata={'P': 'Virat Kohli was named the player of the tournament and also scored the most runs, while Mohammed Shami was the leading wicket-taker. A total of 1,250,307 spectators attended the matches, the highest number in any Cricket World Cup to date.[1] The tournament final set viewership records in India, drawing 518 million viewers, with a peak of 57 million streaming viewers.'}, page_content='Virat Kohli  was named the player of the tournament and also scored the most runs, while  Mohammed Shami  was the leading wicket-taker. A total of 1,250,307 spectators attended the matches, the highest number in any Cricket World Cup to date. [ 1 ]  The tournament final set viewership records in India, drawing 518 million viewers, with a peak of 57 million streaming viewers.'),
 Document(metadata={'Header 2': 'Background'}, page_content='Background \n Host selection')]
We can see the metatadata indicating the section tag of the chunk. So how many chunks were created?

len(split_content)
231
Let's see how many chunks for each of the sections

from collections import Counter

class_counter = Counter()

for doc in split_content:
    document_class = next(iter(doc.metadata.keys()))
    class_counter[document_class] += 1

print(class_counter)
Counter({'Table': 189, 'P': 26, 'Header 2': 14, 'Header 1': 2})
Now, let us look at the lengths of these chunks

data = [len(doc.page_content) for doc in split_content]

plt.boxplot(data)  
plt.title('Box Plot of chunk lengths')
plt.xlabel('Chunk Lengths')  
plt.ylabel('Values')  

plt.show()

print(f"The median chunk lenght is : {round(np.median(data),2)}")
print(f"The average chunk lenght is : {round(np.mean(data),2)}")
print(f"The minimum chunk lenght is : {round(np.min(data),2)}")
print(f"The max chunk lenght is : {round(np.max(data),2)}")
print(f"The 75th percentile chunk length is : {round(np.percentile(data, 75),2)}")
print(f"The 25th percentile chunk length is : {round(np.percentile(data, 25),2)}")

The median chunk lenght is : 71.0
The average chunk lenght is : 326.65
The minimum chunk lenght is : 12
The max chunk lenght is : 13740
The 75th percentile chunk length is : 200.0
The 25th percentile chunk length is : 37.0
Some of the chunk lengths are longer than 1000. Let's try to control that.

from langchain.text_splitter import RecursiveCharacterTextSplitter #Another Character Based Text Splitter from LangChain

text_splitter = RecursiveCharacterTextSplitter(
separators=["\n\n","\n","."], #The character that should be used to split. More than one can be given to try recursively.
chunk_size=1000, #Number of characters in each chunk 
chunk_overlap=100, #Number of overlapping characters between chunks
)

final_chunks=text_splitter.split_documents(split_content)

#Show the number of chunks created
print(f"The number of chunks created : {len(final_chunks)}")
The number of chunks created : 285
data = [len(doc.page_content) for doc in final_chunks]

plt.boxplot(data)  
plt.title('Box Plot of chunk lengths')  # Title
plt.xlabel('Chunk Lengths')  # Label for x-axis
plt.ylabel('Values')  # Label for y-axis

plt.show()

print(f"The median chunk lenght is : {round(np.median(data),2)}")
print(f"The average chunk lenght is : {round(np.mean(data),2)}")
print(f"The minimum chunk lenght is : {round(np.min(data),2)}")
print(f"The max chunk lenght is : {round(np.max(data),2)}")
print(f"The 75th percentile chunk length is : {round(np.percentile(data, 75),2)}")
print(f"The 25th percentile chunk length is : {round(np.percentile(data, 25),2)}")

The median chunk lenght is : 75.0
The average chunk lenght is : 272.27
The minimum chunk lenght is : 6
The max chunk lenght is : 1000
The 75th percentile chunk length is : 377.0
The 25th percentile chunk length is : 50.0

Congratulations
With this, you have successfully completed the chunking of the data. We move now to the next step of creating Embeddings

But before that, check out the splitters available in LangChain

Text Splitters - [https://python.langchain.com/docs/concepts/text_splitters/]

Data Conversion (Embeddings)
Computers, at the very core, do mathematical calculations. Mathematical calculations are done on numbers. Therefore, for a computer to process any kind of non-numeric data like text or image, it must be first converted into a numerical form.

Embeddings is a design pattern that is extremely helpful in the fields of data science, machine learning and artificial intelligence. Embeddings are vector representations of data. As a general definition, embeddings are data that has been transformed into n-dimensional matrices. A word embedding is a vector representation of words.

3.6.png

Open Source Embeddings from HuggingFace
Let's begin with an opensource embeddings from HuggingFace!

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

hf_embeddings = embeddings.embed_documents([chunk.page_content for chunk in final_chunks])
print(f"The lenght of the embeddings vector is {len(hf_embeddings[0])}")
print(f"The embeddings object is an array of {len(hf_embeddings)} X {len(hf_embeddings[0])}")
The lenght of the embeddings vector is 768
The embeddings object is an array of 285 X 768
OpenAI Embeddings
OpenAI, the company behind ChatGPT and GPT series of Large Language Models also provide three Embeddings Models.

text-embedding-ada-002 was released in December 2022. It has a dimension of 1536 meaning that it converts text into a vector of 1536 dimensions.
text-embedding-3-small is the latest small embedding model of 1536 dimensions released in January 2024. The flexibility it provides over ada-002 model is that users can adjust the size of the dimensions according to their needs.
text-embedding-3-large is a large embedding model of 3072 dimensions released together with the text-embedding-3-small model. It is the best performing model released by OpenAI yet.
OpenAI models are proprietary and can be accessed using the OpenAI API and are priced based on the number of input tokens for which embeddings are desired.

Note: You will need an OpenAI API Key which can be obtained from OpenAI

To initialize the OpenAI client, we need to pass the api key. There are many ways of doing it.

[Option 1] Creating a .env file for storing the API key and using it # Recommended
Install the dotenv library

The dotenv library is a popular tool used in various programming languages, including Python and Node.js, to manage environment variables in development and deployment environments. It allows developers to load environment variables from a .env file into their application's environment.

Create a file named .env in the root directory of their project.
Inside the .env file, then define environment variables in the format VARIABLE_NAME=value.
e.g.

OPENAI_API_KEY=YOUR API KEY

from dotenv import load_dotenv
import os

if load_dotenv():
    print("Success: .env file found with some environment variables")
else:
    print("Caution: No environment variables found. Please create .env file in the root directory or add environment variables in the .env file")
Success: .env file found with some environment variables
[Option 2] Alternatively, you can set the API key in code.
However, this is not recommended since it can leave your key exposed for potential misuse. Uncomment the cell below to use this method.

# import os
# os.environ["OPENAI_API_KEY"] = "sk-proj-******" #Imp : Replace with an OpenAI API Key
We can also test if the key is valid or not

import os

api_key=os.environ["OPENAI_API_KEY"]

from openai import OpenAI

client = OpenAI()


if api_key:
    try:
        client.models.list()
        print("OPENAI_API_KEY is set and is valid")
    except openai.APIError as e:
        print(f"OpenAI API returned an API Error: {e}")
        pass
    except openai.APIConnectionError as e:
        print(f"Failed to connect to OpenAI API: {e}")
        pass
    except openai.RateLimitError as e:
        print(f"OpenAI API request exceeded rate limit: {e}")
        pass

else:
    print("Please set you OpenAI API key as an environment variable OPENAI_API_KEY")
OPENAI_API_KEY is set and is valid
Now we will use the OpenAIEmbeddings library from langchain

# Import OpenAIEmbeddings from the library
from langchain_openai import OpenAIEmbeddings


os.environ["TOKENIZERS_PARALLELISM"]="false"

# Instantiate the embeddings object
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create embeddings for all chunks
openai_embeddings = embeddings.embed_documents([chunk.page_content for chunk in final_chunks])
print(f"The lenght of the embeddings vector is {len(openai_embeddings[0])}")
print(f"The embeddings object is an array of {len(openai_embeddings)} X {len(openai_embeddings[0])}")
The lenght of the embeddings vector is 1536
The embeddings object is an array of 285 X 1536
Congratulations
With this, we have successfully completed the creation of embeddings. We move now to the next step of storing the embeddings in a Vector Store

Read more about Embedding Models Here

Storage (Vector Databases)
The data has been loaded, split, and converted into embeddings. For us to use this information repeatedly, we need to store it in memory so that it can be accessed on demand. Vector Databases are built to handle high dimensional vectors. These databases specialize in indexing and storing vector embeddings for fast semantic search and retrieval.

Facebook AI Similarity Search (FAISS)
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_community.vectorstores import FAISS

index = faiss.IndexFlatIP(len(openai_embeddings[0]))

vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

vector_store.add_documents(documents=final_chunks)
['0807dede-2485-4eb6-8345-58fc1bd59416',
 'ec721e7a-8824-48ac-9b48-b558576174ba',
 'dfbbef55-d344-4ad4-b41d-de60688982fe',
 '9996753d-55c7-4185-af20-0894a1d2014c',
 '9cf5ed83-0363-4db0-bb0b-6835061cb635',
 'b7452660-65bc-4d57-8162-c681111f1fda',
 'f3b419b1-2fb2-4ec4-9f76-fdd04a26b8b4',
 '65f3ea52-acee-4424-8107-ea18b9aa54d0',
 'f6fc161e-02b7-48b7-9f8a-10413b5e78fe',
 '419f8858-3c68-4f51-bf80-631887c8bb16',
 'c2d1a280-8c6c-42b8-b15b-bb3cc2e6c45f',
 '97afc107-ea3a-4fea-881c-fa4a91da94d2',
 'c3555fb7-504e-4d42-9d5d-8050a9713bf1',
 'bf253a50-0b78-40b1-bd0a-80fa5a2651f0',
 '1b863382-e3a5-4537-b375-31a77e39c37a',
 '04d335f2-1b12-4008-9f58-f0f31cebc8ca',
 '79f66473-5a8f-4c19-bb4a-1c572e6a06cb',
 '153b7176-cc4c-45a6-916f-17e5b6e1013c',
 '25b18839-9100-49b8-88b6-6efe55914fae',
 '14dfe3bd-8af4-41ce-9d13-47d3800d46b9',
 'd1dda697-f3ef-44cb-8085-b320f0532a48',
 '3e018a7c-9ba3-4916-83e3-82a631655b9f',
 '2f3d872a-b3b6-4ad5-831b-0bcf55b468fc',
 'fdbdae59-a6dc-4668-8865-e1989ab8fa8b',
 'd0c64dab-aa29-4fe1-ad7e-38dde255e7f6',
 '428a610f-b553-49f9-8dd5-895da6eb7d72',
 'ba488dfa-12d0-4073-b67a-4d84a56bc664',
 '1a7431ce-3b1c-4ec0-aa84-c143e3b0b6fb',
 'ec179bdb-9cef-41e9-b6df-19208aa85fe2',
 '61016f5d-dc36-4d22-af9d-87e1d2789891',
 '0d624f14-d338-4e86-9416-ff9121b76eaf',
 'd85a5be3-9ef5-4a30-bf71-bbc777dd2c45',
 '0df3e73b-f73c-47fb-9aa0-fd2c47254bbb',
 '8a79d977-7dde-4064-b2ae-85189b0b06d7',
 '0bdb90bd-04f7-4b08-b321-b4300abab2b3',
 'aa81500b-c6f0-4932-9ff3-26dcead802fc',
 '8b4c1782-9aa2-492c-a156-428ec5621859',
 '6b47ec88-df2d-461e-9f60-7962fe6bc57c',
 '3b3f3dae-91ab-4dfb-88a5-95e1e9355cb0',
 '4729fe9c-9ca4-42eb-9105-d8b66d30efa2',
 '05c81722-3592-4b14-9255-171c1f91e437',
 'ff35fbd1-6846-49f1-ad6c-ab2693690362',
 '5aac4dbe-571c-46e4-883b-5022b20aff54',
 '1bc96595-2e5f-471e-93c3-4198f4c270fb',
 'd407ff1e-fcb2-4a66-8145-3f0e223c915c',
 '527fdaaf-5fa4-4fa8-8ded-5b7f8547aecb',
 'ad38c2df-942f-4327-885f-f616f18dd2cf',
 '85de1c10-fffe-4791-b6ae-6bfb5c00dac7',
 'ff0dd7a2-5734-4207-8695-7edc0dcaaf27',
 'cc5f43a4-9b7a-41c4-b8c9-0b21fe4d8b53',
 '1542b58c-16be-485a-8f62-9013c0dbebb0',
 'fad2f286-e71c-4b1c-8c9c-533973868dfa',
 '7b6130fb-638d-4084-b800-ebc8121eaaaa',
 '22a63e80-8296-4775-9881-6c61119b2760',
 'f353c0ef-f6ce-4e43-a27f-b18387156573',
 '44b9651c-db48-442c-a22c-63a5c8804a5f',
 '8d6abcd5-ad54-4b9e-9198-29b835659509',
 'a19e3e28-c413-4dfa-a910-d3c7718d5176',
 '27637ec9-e841-4957-a381-d34e48e11532',
 '50a6b531-b7b2-4d9c-996f-5e370f177e45',
 'bd513c33-ce4b-43ed-a73b-0181fe77f4df',
 '38631b9b-d2a5-45b3-b830-619e6016dbba',
 'f6d504e0-b509-413f-ae83-2c2ff9ff08ef',
 '43f25d0f-53db-4a0e-9e78-7409bf017c66',
 'abef4668-ee73-4f6b-b056-28e07c4a780e',
 '32865c3d-fe19-4d09-aafe-552a6a4c6452',
 '96799eae-1d93-440e-bb83-cd9030c8db88',
 '66566c2c-0239-45fd-86f9-3f85b3f7aa16',
 'ac7eeb85-2ea6-41cc-886d-2e6b08969854',
 '0d698b32-652f-461b-a13c-db93063a1a1a',
 '2a91abe3-7a4e-4e4f-8b7c-e85f1b900a69',
 'e1b5cafb-cef3-4c24-aef8-0986f0d6805d',
 '20205980-a916-42a0-8f1d-7ced711a90fb',
 '1be87383-1f17-45c7-94a3-bb33103676e8',
 '91769ce3-b8d7-4d0f-a2f5-478e644bb4d8',
 '6c769adc-a003-4e68-a6e1-73ddf2f3e8ea',
 '2a48a2e6-e83e-4d29-aa5d-5020c4b96084',
 '026c6f40-2c2b-4b0f-952f-4e8cdf11d9f2',
 '3955831c-48b0-457f-8857-93d0d78434cd',
 '3222ca2e-61a0-4270-aeea-25220975d9cd',
 '84fa1694-5cb1-434a-a31e-adb60aad6fe3',
 'c520957e-baf8-4802-8638-266433dc94e3',
 '3ce86200-d285-4b98-8476-f702e85c06ba',
 '0bd4acdc-44a9-4b38-8978-47e2361b717e',
 '70ec9e31-f1dc-4281-bc1a-c39b1ed9f627',
 '9ff6dcbf-d478-42fe-a419-d2375d3cc9c8',
 '674339c7-339f-426d-8904-79d29b6904bb',
 '91a939b6-a20c-44e4-93db-afa7c64ac006',
 '24c187bf-e9fe-4595-b112-4e0406980b80',
 '26cb4077-5090-4ec6-b17b-ce473a5a23f7',
 'a4f0d979-4151-4a56-a1b5-d581f81cc6f1',
 'd8ead914-ce16-406c-89ad-eb50c859e178',
 '29844566-23a2-4ac2-9f17-b549ac7667af',
 '61d36722-8c69-49d6-a731-1aaa34d38c19',
 '4e95c2a7-06d5-4180-a4dc-facc10482599',
 '554743f4-2b33-49b7-a4e9-a35dc8246e3e',
 'd385bfc7-b6c3-4521-b223-7a3513a54f87',
 '36cdacbd-82cc-4605-9267-b9297fe41fdf',
 '80bffc24-0165-45ac-b351-f63bf37f18b3',
 'd3b3eddf-0d99-494e-ae86-77c1aa6bf9fd',
 'ff88b8aa-316e-4067-8c1f-5606c2cda526',
 '068e98c8-969e-43b9-8efd-6da52ccbeabb',
 'c62aebae-4ea0-4f35-b667-c1a8b74d0d90',
 '9ab0205c-fe57-40ff-95fc-f54658062522',
 '6dacebe2-67e8-46fb-a249-fa3a8fadea61',
 'e101acc9-b397-42b8-a602-f04cf822b240',
 'e9db3ba0-d603-4922-824d-d67b0a75e3e3',
 'a06bf695-13e6-478c-aa66-904fc01a653f',
 '3feeb17f-de33-4499-bb09-eb0896ab001d',
 'df326e96-2fde-486d-98c1-aeaba3ba2058',
 '5fb55d02-9236-4045-85e1-d04aa685941b',
 'e61d0232-3a58-4065-afd5-32fe72fe94b2',
 '4291a6f1-67c8-43f5-b138-86273ae39179',
 '735a8c21-4da2-4dcf-8df8-16f4133689e3',
 '13b6922e-b824-454a-9287-7e0479fc57d7',
 'a73164de-c2b6-484f-afff-16f548d966d2',
 '3e948b5f-73fc-4d22-a365-016402a3f353',
 '450cbca1-58b7-4b86-9a07-63cfce3211ed',
 'dd29d597-9599-46b8-893e-475797e1661d',
 '06f88b60-8286-4e52-aa83-364ff2e3f4a3',
 'c727c6e1-9b22-4d53-a86b-fc6ce0131cfe',
 'dc9fecad-c9c9-41c4-af55-e0f7842dd57f',
 '88d6f09d-d90c-4286-81a5-7bb0b2cf58aa',
 'a55a69d1-3aed-4c12-954b-f81083dd27ee',
 '72cb6a93-3385-4db9-ad16-d4f0cbab055d',
 '03c1ab8e-bb21-416f-a2ab-56e82a656d85',
 '576dc701-7186-45b0-a771-9997de79aa04',
 '15dca4c1-38aa-4bab-8173-db1612f2d8a5',
 '13556a87-f48d-48b0-9e25-63316e8bbf49',
 '140b5382-38f8-4c6b-bdeb-50fd9756f814',
 '2761916b-32d8-426d-8edf-06f86f1b698c',
 'abc977ca-87ab-42b4-b458-f9a7e4f27f69',
 '0c45c580-4d67-49c1-9b55-2fad2df7a8c4',
 'c0ea10ab-4e69-48c7-9633-149dfb94de87',
 '3aefce61-c545-4c10-8c21-3b9285e92042',
 'c49deffc-5ccb-4d6c-9205-aa60db39efd5',
 '857b08a2-45c4-41ac-b6b5-806eba5a5b3c',
 'ef5b82a8-f7da-4d48-b696-06660ce6c412',
 '511214f7-e88d-4ac4-8f51-6d79fa0ca25e',
 'e2da04f3-1812-429b-806b-94c3a45a555b',
 '92e216eb-5367-42e6-b92c-c723f17c4a39',
 'a6faa50c-24d9-4f9a-9920-125ab65b25ee',
 '70ba2c3e-c5fb-400e-91f0-ccb314addb43',
 '89250342-a1ae-40ed-a67b-f13b59bf3571',
 'd4b15e80-ff14-4bda-a912-2c37bf5c1924',
 '3151ca03-c23e-4c27-8da8-443e3d8e59dd',
 '57df51d3-2d44-44a3-bab6-fd1613fa5c72',
 '9cb7d277-cd06-43fb-af9e-50fdc80b2c62',
 'bd2d3f21-8921-4e52-a225-aaaae6bf4186',
 '3abd90ab-46a3-46fb-82f5-bdec2b47ebf3',
 'e55e1ca2-af49-43b0-9422-ea37afc3167d',
 '1b79bf11-a110-49e7-8805-27271fb99816',
 '932d4bd3-6e0b-4e83-8ec7-56b8dd1e3a34',
 '8e4203e0-8cbb-4f5d-8bb9-b3b4bac83ddb',
 '8c4fbc10-f770-4ed2-89f5-ad2bfee87c19',
 '8b0f85ab-f17c-42b2-8b27-c969adc21974',
 'fadfbfc9-f608-4fe5-96e3-55903af84a79',
 '9c0e3010-cbde-4f24-bee9-634d9754902f',
 'ea60d4de-23d3-4232-9564-4ac2c351a530',
 '33f0c0b7-9491-4555-8f14-993a1da412be',
 'e3fef8c4-d8c3-4f78-b677-01d9c842d408',
 '75e2d9a6-34c7-4803-93ae-02c115cbf502',
 '4e40457b-06de-4022-a511-4d6e84e7433e',
 '2e0d53e9-7016-437b-a962-210190301bee',
 '3ff0b54f-0b0f-4378-b06f-1a2d76900b4e',
 '86f3a08c-47f0-42e2-8d53-51cf6a77a2d2',
 'a2b4b5a6-9a10-4fec-b6a2-8728db0f237e',
 '7673c54e-682c-47e3-b699-5f3047394cf4',
 '313035b0-261d-405c-9ca1-31fc6951410a',
 '9acd74f0-347e-400a-9e0c-b1f7b766d059',
 '11181d2f-c45d-4855-8bb6-3bbc162e2d83',
 '361bdeab-870a-443d-b9cc-5e3ed947be35',
 'a7429f5c-8643-4990-a797-9b7fe8f2fe99',
 'dd17ffac-8417-4541-a609-34f86c510f38',
 '76bb9bf8-3199-4b96-82f4-3584abb9e9a1',
 '49f26b2e-4f7e-4749-8049-d565d6ac58e3',
 'd99a2662-b2b5-4962-8377-57480e919f6a',
 'a6741818-86a3-4159-92e4-1969849e0832',
 '60fde5b3-443f-417d-a996-45c1b2101f2d',
 'ce9bdefb-3da6-4158-80f5-869ce4bee86c',
 '08bc287a-74f9-46b6-b602-168b644500d7',
 '3edaf30c-d049-40ba-9037-40603e385942',
 'f4545ddb-ae5e-49c8-ab0b-d7ce949183fb',
 '848e848b-6c9d-43bd-b570-608a9117fcb5',
 '07249f1f-88b3-43e9-a2e6-8d7e8fd536b9',
 '3510ea39-925b-41c6-8837-97b05ed4d01b',
 '69ebbefc-c3ef-4654-9f3a-93edffb02c47',
 '4b389fe8-808e-4ea3-b699-adb0c1955d0f',
 '0ae8c322-c017-4442-941f-64bc3fb6db2f',
 '9664ebec-168a-4fb9-a28c-0182628d1dbe',
 'cc882867-8cba-4b6a-976d-1ac9f5eac6db',
 '9cbf1a0f-d83c-47cf-be33-a38aead5611e',
 'cd7b92c3-b532-41df-9511-4d18b421e791',
 'd72a1e62-2b20-41d4-9ed9-92c19d39f91b',
 'c7c07433-bbf7-4e5c-8134-c1927bc1ce00',
 '153548e7-d325-453c-bdf3-54efa6ed21c1',
 '2afbdfd0-42c6-4734-a697-5a8f9f26ce3f',
 '5116f35b-9f0e-46bb-b3b9-00d41289069b',
 'fb6a1767-4189-4899-a7d7-65ba0ef8e707',
 '3bfee62f-e4ee-449b-b07f-e1808b120c6c',
 '96ca59f1-ff94-4419-9c94-fce2fd2d5886',
 '98fa9304-52b0-43fb-bf66-7153f4dd9c72',
 'bb5d2b4d-6cd0-46a2-a7c1-618655578070',
 '2e341d26-c594-4b7b-b6ce-ca3a5d196768',
 'bcff9239-041b-4d52-894a-064fee08d5aa',
 'c53cd1e5-56a4-46d6-a504-7cb4a0dfb897',
 '94de1d38-7b22-4c4b-a1d6-272576d26970',
 'd6df9a1e-e131-4358-b12e-a8bce6bb1d4d',
 'abfdf1b1-1af9-47d1-9d3c-1464f9ec7e8f',
 'f4f641ec-b5cb-4fed-97a7-4ca9e6bc8bab',
 '59ad031a-30c9-4885-b557-4695e99d85d1',
 '15195218-ebf1-4e12-9353-65fe6eb0c506',
 '90387223-68f9-44f7-9cd0-de5e032cbeda',
 '653deeb1-b9a6-4bdf-a23b-9fd6cbdf7359',
 '416da4fd-0eb7-4be0-adfb-3a74cbbfd81c',
 'f70ff008-5a5e-4d72-8dce-5e9e573d564e',
 '8c7d9ea1-1048-4603-ae0f-e2631ac6a468',
 'c7237570-f29c-48ed-89e8-7e1a83d262f1',
 'ab2c7e01-53cd-435a-bf1f-f83e1a33d79b',
 '04fba336-cff6-45a4-807c-fea78e7bcdfb',
 '9f21238c-2552-401e-ab25-f55ffc28edd3',
 'b3c075b9-9680-4af8-8905-0307f4825c8c',
 '7a9a40f0-ae79-45d3-9e92-18e41f2a4376',
 'ed5462f5-5249-402e-bade-a6209bf596d4',
 '36203bd7-963a-4a60-a750-128a29d613ec',
 'a972aaba-d486-4479-83a8-375b7d9bb8a5',
 '4ebac5b4-2ee1-4b52-a6fc-1a8b51a1e729',
 '74abef8b-b69e-47b8-a2ae-037a1c82c144',
 'b8204cdd-a998-4783-955d-ff1be6ee37a7',
 '28ebbcc6-c8a1-431d-a006-f3bc0d2111b0',
 '82a9cfd2-1179-45de-8e1e-1141592e0d8c',
 '52a5c997-c40b-40a5-aca2-43687d3194ed',
 '219185ad-2c3d-450a-8098-2d2a672086ce',
 'd558c13a-cac3-4af8-830d-ff7ff988fd8f',
 'eab372cc-d116-40b4-9169-c20eb139ee67',
 '2e0c328a-d664-446c-b4ed-f7332163ec4f',
 '1e0875db-7120-4017-9fab-8b2ed70f66cb',
 '22633aa8-483b-469a-bc28-52a481354582',
 'c7b4fe85-2574-4748-bcf3-7b5b93e22e75',
 'ec64f0d0-c4a0-4ca5-96b6-29076f09846d',
 '4b63810c-77fc-41d1-8fe0-2507965d4f5c',
 '2b225ef7-cf60-4ca8-a008-8e3860531045',
 'a96d09ed-7f2c-4466-9b5d-728f73c957d5',
 '95686ae0-fd2c-42bb-9c10-2839b809cf57',
 '8cdac83c-d202-438f-a5c6-e269ca783094',
 '12ab1aac-44c0-4812-83bd-d579119ac1cb',
 '6f20cb71-e401-4178-96f4-68810539f1d9',
 'bb030998-eefe-4d5e-bc4a-2a1ac5a644f5',
 'dab0656e-03c0-403b-bd6b-cfe7a72f0df9',
 'be72f06c-3e0b-48c5-9ae3-123ee55d2e63',
 '18d6d9c6-2176-474d-b772-626e1e11473b',
 '72f1b0bc-335f-4aa6-b032-1e51016f2209',
 '645bea61-09ca-4009-9f98-160da4e86e0a',
 '04ff4468-a3c6-44ce-baed-a5ff6d02a2ea',
 '522ebb6d-8cca-4926-8d7e-b656025d625d',
 'f1fab17d-02c7-4d59-8d6f-5376d37f5637',
 '14243665-4a9c-485a-884a-e6da057cd916',
 'f8d06877-6a12-4876-a11b-774694c7e7c4',
 '65db5135-720a-4ef7-9292-de5e3ad6513c',
 '9c8e6ceb-43ee-47dc-842b-50bf3bd3beab',
 '59c80c99-1a77-4969-a77c-eda2b58c8c9c',
 'adb6f806-7217-4265-9ebd-4bfbd8361e18',
 '851fdcd1-d17b-436e-aff5-f19b4cfb8b87',
 '0cdafb24-4570-4def-8fcc-612dbeb39290',
 '5edf964d-5c9c-404b-8121-ea4e932597b1',
 '54049d3a-7708-42e5-a3e6-b7351c7c196a',
 'a3d6b1c3-c618-4864-ac8e-aef2e597c5f1',
 'f7c7f654-af89-49a9-9768-9541a92a4c71',
 '07bdf24f-72fa-4be6-92d0-18e72f519d92',
 '2146c42f-c122-4623-974e-fa03ad211a1d',
 '0fa5e020-3e95-4b59-b60e-d171d5b7e35b',
 'ac15fd9d-5a7b-40eb-96bd-78f370d8bbba',
 '59d964bb-99d3-4b2f-a16a-12a3470336fa',
 'aad8390b-a863-4752-92da-46cb8c01e247',
 'aae91cbd-a5ee-4d40-bbb3-0803c0ac61d0',
 '78f210d2-a592-4c3e-9b09-4e6d1c6a6273',
 'c69ae577-6e27-4e8a-8384-0aab2d595f37',
 '7fb3ed52-0daa-4a13-bc5a-928cdc79d268',
 '22c5a281-a991-40d9-8789-89da115ef589',
 '6240a614-da5a-4634-ab7e-23e0ead1c614',
 'a99f32b8-5573-49e0-8421-ecd415cae098',
 '41939e36-54cb-4b97-a370-ee86967b7a45',
 '7deb5634-b9f1-45cf-b14d-3edfba14ee6b',
 'f5c19edb-2d3a-4eab-880c-cd4db79dab41',
 '955fc567-bcec-448d-97a1-5252f1d37e8b']
We can also save the vector store in persistent memory to use it later!

vector_store.save_local(folder_path="../../Assets/Data",index_name="CWC_index")
Search and Retrieval Example
As a glimpse of what happens next, this Vector Index can be used to retrieve documents that are most relevant to the query.

# Original Question
query = "Who won the World Cup final?"
# Ranking the chunks in descending order of similarity
docs = vector_store.similarity_search(query)
# Printing the top ranked chunk
print(docs[0].page_content)
The tournament was contested by ten national teams, maintaining the same format used in  2019 . After six weeks of round-robin matches,  India ,  South Africa ,  Australia , and  New Zealand  finished as the top four and qualified for the knockout stage. In the knockout stage, India and Australia beat New Zealand and South Africa, respectively, to advance to the final, played on 19 November at the  Narendra Modi Stadium  in  Ahmedabad . Australia won the final by six wickets, winning their sixth Cricket World Cup title.
Final Thoughts

A Simple Guide to Retrieval Augmented Generation serves as a bridge for tech professionals to move from theoretical AI to building production-ready systems. Whether you are a data scientist or a curious developer, mastering RAG is becoming an essential skill for the next generation of AI applications.
As always, thanks for sparing some time and reading this article. 🤗