Marco Gonzalez for AWS Community Builders

Posted on Jan 27 • Edited on Feb 8

RAG Integration: DeepSeek’s New BFF in the AI World

#ai #rag #aws #azure

In this tutorial, I'll show you how to build a backend application using Azure OpenAI's Language Model (LLM) and introduce you to what's new with DeepSeek's LLM. It's simpler than it might sound!

Important Notes:

I will use Azure OpenAI Cloud service as an example. However, the steps and tips are applicable to any cloud provider you might be using.

May difference between OpenAI and DeepSeek does not lie on the setup, but the performance, so feel free to replace "DeepSeek" everytime you see "OpenAI" in this blog entry.

Topology

Explanation:

A) Data Ingestion Implementation:

1. Extract & Split text: The document (e.g., PDF or Excel files) is broken into smaller chunks of data.

2. Data Chunk: These chunks contain portions of the document text, ready for embedding.

3. Vector Representation: An embedding model processes each chunk of data, converting them into vector representations.

4. Data Vector Embedding: The data chunks are transformed into embeddings (numeric vectors) that represent the content.

5. Index & Save: The embeddings are stored in a vector database (Vector Store) for later retrieval.

Data Retrieval Implementation:

1. User Query: A user submits a query to the system. Key request Body parameters to consider are: Max Tokens, Temperature, Top_K

2. Query Vector Embedding: The embedding model converts the user query into an embedding vector.

3. Similarity Check: The query vector is compared against the stored data vector embeddings to find similar content.

4. Retrieval: The system retrieves the most relevant document chunks based on the similarity check.

5. Relevant Document Chunks: The retrieved relevant chunks of data are prepared for further processing.

6. Prompt: The system combines the user query and the relevant document chunks.

7. Multimodal Model: A multimodal model (like ChatGPT) processes the combined information.

8. Answer: The final answer is generated in JSON format and presented to the user.

Tools to use

For this tutorial, the following tools and Information will be used:

Python
Visual Studio Code
Azure OpenAI Service
DeepSeek
Endpoint: API Key

Implementation

Data Ingestion Implementation:

This step involves processing and storing external knowledge sources (e.g., documents, databases, or files) into a system where the LLM can later retrieve relevant information.
In this tutorial, we will use the Open-source Framework LangChain and Azure OpenAI service. LangChain is a framework for developing applications powered by large language models (LLMs). You can find more information about this framework here: Introduction | 🦜️🔗 LangChain

We will start by creating a new project/open an existing project in VSC.
Then we will open a Git Bash prompt and enter the following:

python3 -m venv venv
source venv/Scripts/activate

Before implementing the Data Ingestion, you need to install the following dependencies using the pip command within the virtual environment created in the previous step.

pip install langchain langchain-openai langchain langchain python-dotenv langchainhub black langchain-community python-dotenv faiss-cpu tiktoken

Now I will create 4 files for this tutorial:

.env: We will store all environment variables as a good security practice. In Commercial environment, this file will be stored in Key vault

rag-text.txt: We will use a text file to create our Context source for RAG implementation.

rag_ingestion.py: We will create the code here to ingest a demo file into Vector Store.

rag_retrieval.py: We will create the code here to use LLM and the indexed data to retrieve relevant pieces of information for a specific question.

We will now describe the content of each file:

.env File:
Thi file will include all necessary credentials to interact with LLM APIs, including Keys, model names and versions

# OpenAI Configuration
AZURE_OPENAI_ENDPOINT='https://[UR]/'
AZURE_OPENAI_API_KEY=<API_KEY>

OPENAI_MODEL_NAME_EMBEDDING=<OPENAI_EMBEDDING_MODEL>
OPENAI_API_VERSION_EMBEDDING=<OPENAI_EMBEDDING_API_VERSION>

OPENAI_MODEL_NAME_LLM=<OPENAI_LLM_MODEL> #e.g. GPT-4o
OPENAI_API_VERSION_LLM=<OPENAI_LLM_API_VERSION>

'rag-text' file:

For the ‘rag-text’ file, we need to paste the reference Data to be copied and pasted the content of this URL: rag-text

‘rag_ingestion.py’ file:

I will import the necessary packages and call the .env file. Based on the input data ‘rag-text’, we will divide data into chunks, generate vector embeddings and store into the faiss_index Database.

rag_ingestion.py

import os
from dotenv import load_dotenv

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import CharacterTextSplitter
from langchain_openai import AzureOpenAIEmbeddings

load_dotenv()

if __name__ == '__main__':
    print("Ingesting...")
    loader = TextLoader('rag-text.txt', encoding='utf-8')
    document = loader.load()

    print("splitting...")
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_documents(document)
    print(f"created {len(texts)} chunks")

    embeddings = AzureOpenAIEmbeddings(
    model=os.environ["OPENAI_MODEL_NAME_EMBEDDING"],
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    openai_api_version=os.environ["OPENAI_API_VERSION_EMBEDDING"]  #Use the correct version for the LLM Model selected
    )

    print("vector storing...")
    db = FAISS.from_documents(texts, embeddings)
    print(db.index.ntotal)
    # Write our index to disk.
    db.save_local("faiss_index")
    print("Ingestion is finish")

Let's discuss in detail each of the steps performed:

1.Extract & Split text & 2. Data Chunks: The document (e.g., PDF or Excel files) is broken into smaller chunks of data. I use Langchain modules: TextLoader and CharacterTextSplitter.

Chunk size: Customized parameter and the value depends on the Content type and desired retrieval quality. I chose 1000 as a good-balance number for standard text document, but you can adjust it based on your needs.

Overlap: If you need to ensure that no context is lost between chunks, using some overlap (e.g., 50-200 characters) can help. This makes sure that important information at the boundaries of chunks is not missed.

from langchain.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

[...]
    loader = TextLoader('text-files/rag-text.txt', encoding='utf-8')
    document = loader.load()

[...]
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_documents(document)
[...]

Vector Representation & 4. Generate embeddings: The below code is configuring embeddings for an Azure OpenAI model using LangChain. Here's what each part does:

embeddings= AzureOpenAIEmbeddings(...): This initializes an instance of the AzureOpenAIEmbeddings class, which is part of LangChain. This class is responsible for generating vector embeddings from text using an Azure-hosted OpenAI model.

model=os.environ["OPENAI_MODEL_NAME_EMBEDDING"]: This retrieves the name of the embedding model from an environment variable (OPENAI_MODEL_NAME_EMBEDDING). This model will be used to generate the embeddings.

azure_endpoint=os.environ["OPENAI_ENDPOINT"]: The code pulls the Azure OpenAI endpoint URL from an environment variable (OPENAI_ENDPOINT). This is the endpoint of the Azure OpenAI service where the API requests are sent.

api_key=os.environ["OPENAI_API_KEY"]: It retrieves the API key for authenticating requests to Azure OpenAI from an environment variable (OPENAI_API_KEY). This key is necessary to access the Azure OpenAI service.

openai_api_version="2023-05-15": This specifies the version of the OpenAI API being used, ensuring compatibility with the correct API features and models. The version "2023-05-15" is the API version used for the selected language model.

    embeddings = AzureOpenAIEmbeddings(
    model=os.environ["OPENAI_MODEL_NAME_EMBEDDING"],
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    openai_api_version="2023-05-15"  #Use the correct version for the LLM Model selected
)

Index & Save:

The last part of the code uses the FAISS module to store the embedded vectors we have generated.

    print("vector storing...")
    db = FAISS.from_documents(texts, embeddings)
    print(db.index.ntotal)
    # Write our index to disk.
    db.save_local("faiss_index")
    print("Ingestion is finish")

To execute, run the below command in the command prompt

python rag_ingestion.py

(openai) C:\Users>python rag_ingestion.py
Ingesting...
splitting...
Created a chunk of size 1180, which is longer than the specified 1000      
Created a chunk of size 1058, which is longer than the specified 1000      
created 16 chunks
vector storing...
16
Ingestion is finish

B) Data Retrieval Implementation:

I will keep using LangChain for this last part of Data Retrieval Implementation, including the Azure OpenAI LLM Model chatgpt4o and embedding Model text-embedding-ada-3-large.

‘rag_retrieve.py’ file

Import the necessary packages and call the .env file. Using the vector embeddings stored in the faiss_index Database, we will perform a RAG query.

rag_retrieve.py

import os
from dotenv import load_dotenv
from langchain_openai import AzureChatOpenAI,AzureOpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA

load_dotenv()
if __name__ == "__main__":
    print("Retrieving...")
    # Initialize Azure OpenAI embeddings with custom model name and correct API version
    embeddings = AzureOpenAIEmbeddings(
            model=os.environ["OPENAI_MODEL_NAME_EMBEDDING"],  # Use custom embedding model name from environment variable
            azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
            api_key=os.environ["AZURE_OPENAI_API_KEY"], 
            openai_api_version=os.environ["OPENAI_API_VERSION_EMBEDDING"],  # Correct API version for embedding
            openai_api_type="azure",  # Specify the API type
        )
    # Load the saved FAISS store from the disk.
    db = FAISS.load_local("faiss_index",  embeddings, allow_dangerous_deserialization=True
)

    llm = AzureChatOpenAI(
        azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
        azure_deployment=os.environ["OPENAI_MODEL_NAME_LLM"],  # Correct deployment model
        api_version=os.environ["OPENAI_API_VERSION_LLM"],  # Correct API version for deployment model
    )

    qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=db.as_retriever()
    )

    # Define the query
    query = "What is Pinecone in Machine Learning?"
    result = qa.invoke({"query":query})
    # Print the result
    print(result)

To execute, run the following command:

python rag_retrieve.py

(openai) C:\>python rag_retrieve.py
Retrieving...
{'query': 'What is Pinecone in Machine Learning?', 'result': "Pinecone is a fully managed cloud-based vector database designed to facilitate the building and deployment of large-scale machine learning (ML) applications. It is optimized for storing and retrieving vector embeddings generated by ML models, which represent complex data such as images, text, or audio as numerical vectors. These embeddings capture the essential features of the data, enabling efficient processing and analysis.\n\nKey features of Pinecone in the context of machine learning include:\n\n1. **Scalability**: Pinecone can 
handle millions or billions of data points, making it suitable for large-scale ML applications.\n2. **Performance**: It offers high query throughput and low latency search, ensuring fast and efficient retrieval of similar data points based on their vector representations.\n3. **Real-time Updates**: 
Pinecone supports real-time updates, allowing for efficient updates to the 
vector database as new data points are added.\n4. **Infrastructure Management**: Pinecone provides infrastructure management and maintenance, alleviating the need for users to handle these tasks themselves.\n5. **Security**: 
It meets the security needs of businesses and organizations.\n6. **User-friendly API**: Pinecone offers a simple API for storing and retrieving vector data, making it easy to integrate into existing ML workflows.\n7. **Integration**: Pinecone can be synced with data from various sources using tools 
like Airbyte and monitored using Datadog.\n\nOverall, Pinecone's capabilities make it an ideal platform for building and deploying ML applications that require efficient and scalable management of vector data."}

Making the Shift to DeepSeek

1/27/25 Update

I will update this with more detailed information, but as of now the easiest way to call DeepSeek API I found it here: DeepSeek

Note:
Instead of OpenAI embbedding model, I will use HuggingFaceEmbedding to interact with DeepSeek

rag_retrieve.py

import os
from dotenv import load_dotenv
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain_openai.chat_models.base import BaseChatOpenAI  # Import DeepSeek model
from langchain_community.embeddings import HuggingFaceEmbeddings  

load_dotenv()

if __name__ == "__main__":
    print("Retrieving...")

    # Initialize DeepSeek-compatible embeddings (using HuggingFace as an example)
    embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-mpnet-base-v2"  # Replace with a DeepSeek-compatible model if available
    )

    # Load the saved FAISS store from the disk.
    db = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)

    # Replace AzureChatOpenAI with DeepSeek BaseChatOpenAI
    llm = BaseChatOpenAI(
        model='deepseek-chat', 
        openai_api_key=os.environ["DEEPSEEK_API_KEY"],  # Replace with DeepSeek API key from .env
        openai_api_base='https://api.deepseek.com/', 
        max_tokens=1024
    )

    qa = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=db.as_retriever()
    )

    # Define the query
    query = "What is Pinecone in Machine Learning?"
    result = qa.invoke({"query": query})
    # Print the result
    print(result)

Want to know more about DeepSeek? Check below links:

Explore DeepSeek Resources

📚 DeepSeek API Documentation

🚀 Dive into the official DeepSeek API docs to get started with integration and advanced features.

🔗 DeepSeek Pricing

💡 Learn about DeepSeek's pricing plans and choose the best option for your needs.

Top comments (1)

Vinayak Mishra • Jan 28 • Edited

Loved the BFF tag lol! Let's how this friendship performs when evaluated :) This struck my mind as I was reading something on Evaluating RAG performance: Metrics and benchmarks.