<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mikhail Borodin</title>
    <description>The latest articles on DEV Community by Mikhail Borodin (@mikhailborodin).</description>
    <link>https://dev.to/mikhailborodin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F705833%2Ff7a08c0d-84ab-4259-a264-5b065e229b81.jpeg</url>
      <title>DEV Community: Mikhail Borodin</title>
      <link>https://dev.to/mikhailborodin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mikhailborodin"/>
    <language>en</language>
    <item>
      <title>Running locally DeepSeek-R1 for RAG</title>
      <dc:creator>Mikhail Borodin</dc:creator>
      <pubDate>Fri, 21 Feb 2025 09:45:49 +0000</pubDate>
      <link>https://dev.to/mikhailborodin/running-locally-deepseek-r1-for-rag-1dpl</link>
      <guid>https://dev.to/mikhailborodin/running-locally-deepseek-r1-for-rag-1dpl</guid>
      <description>&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;For the last 2 weeks the internet has been abuzz with the emergence of the new DeepSeek generative model. The biggest surprise was that with similar quality of ChatGPT answers it cost an order of magnitude less to train, this has already affected Nvidia's stock price, which lost about 20% of its value in one go. This model can be used for free either through the web or through a mobile app. But today I would like to highlight another of the possibilities of using this model, or rather the possibility of running it locally. And let's look at it on a small project. Let's imagine that we have documentation and we need to search for information or analyze it. For this purpose we can apply the RAG technology. &lt;/p&gt;

&lt;h3&gt;
  
  
  What is RAG?
&lt;/h3&gt;

&lt;p&gt;Retrieval-Augmented Generation (RAG) is an advanced AI technique designed to improve the accuracy and reliability of language models by integrating the search for external information into the response generation process.  Unlike traditional generative models that rely solely on pre-trained knowledge, RAG dynamically searches for relevant information before generating a response, reducing hallucinations and improving fact accuracy.   &lt;/p&gt;

&lt;p&gt;RAG's workflow consists of three key steps.  First, it retrieves relevant documents or data from a knowledge base, which may include structured databases, vector stores, or even real-time APIs.  The retrieved information is then combined with the model's internal knowledge to ensure that the answers are based on relevant and reliable sources.  Finally, the model generates a reasoned answer using both its trained language capabilities and the newly acquired data.   &lt;/p&gt;

&lt;p&gt;This approach offers significant advantages.  By basing answers on external sources, RAG minimizes inaccuracies and ensures that information is up-to-date, making it particularly useful in fast-changing fields such as finance, healthcare, and law.  It is widely used in chatbots, AI systems with advanced search, enterprise knowledge discovery, and AI-driven research assistants where accuracy and factual validity are critical.   &lt;/p&gt;

&lt;h3&gt;
  
  
  What will the architecture of the solution look like?
&lt;/h3&gt;

&lt;p&gt;A knowledge base is a collection of relevant and up-to-date information that serves as the basis for a RAG.  In our case, these are documents stored in the catalog. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdjx7gl6bt7c6t5odq1ch.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdjx7gl6bt7c6t5odq1ch.jpeg" alt="Image description" width="800" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before you start implementing this architecture, the following libraries must be installed (tested on Python 3.11):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;llama-index   
transformers   
torch   
sentence-transformers   
llama-index-llms-ollama 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's how you can upload your documents to LlamaIndex as objects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llama_index.core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SimpleDirectoryReader&lt;/span&gt; 

&lt;span class="n"&gt;loader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SimpleDirectoryReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;   
    &lt;span class="n"&gt;input_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;input_dir_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;required_exts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   
    &lt;span class="n"&gt;recursive&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;   
&lt;span class="p"&gt;)&lt;/span&gt;   
&lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next step is to build a vector store index, which are a key component of search-enhanced generation (RAG), and so you will use them in almost every application that uses LlamaIndex, either directly or indirectly. &lt;/p&gt;

&lt;p&gt;Vector stores take a list of Node objects and build an index from them  &lt;/p&gt;

&lt;h3&gt;
  
  
  VectorStoreIndex in RAG
&lt;/h3&gt;

&lt;p&gt;In Retrieval-Augmented Generation (RAG), the VectorStoreIndex index is used to store and retrieve vector embeddings of documents.  This allows the system to find relevant information based on semantic similarity rather than exact keyword matches.   &lt;/p&gt;

&lt;p&gt;The process starts with document embedding.  Textual data is converted into vector embeddings using an embedding model.  These embeddings represent the meaning of the text in numerical form, making it easier to compare and find similar content.  Once created, these embeddings are stored in a vector database such as FAISS, Pinecone, Weaviate, Qdrant or Chroma.   &lt;/p&gt;

&lt;p&gt;When a user submits a query, it is also converted into an embedding using the same model.  The system then searches for similar embeddings by comparing the query with the stored vectors using similarity metrics such as cosine similarity or Euclidean distance.  The most relevant documents are retrieved based on their similarity scores.   &lt;/p&gt;

&lt;p&gt;The retrieved documents are passed as context to a large language model (LLM), which uses them to generate a response.  This process ensures that the LLM has access to relevant information, improving accuracy and reducing hallucinations.   &lt;/p&gt;

&lt;p&gt;The use of VectorStoreIndex improves RAG systems by providing semantic search, improving scalability and supporting real-time search.  This ensures that answers are contextually accurate and based on the most relevant information available. &lt;/p&gt;

&lt;p&gt;Building a vector index is easy&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llama_index.core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;VectorStoreIndex&lt;/span&gt; 

&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;VectorStoreIndex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   

&lt;span class="c1"&gt;### Save the index to a file   
&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;storage_context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persist_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next step is to download and run the deepseek-R1 model using ollama. To do this, you need to install ollama &lt;a href="https://ollama.com/download" rel="noopener noreferrer"&gt;https://ollama.com/download&lt;/a&gt; and download the deepseek-r1 model.&lt;br&gt;
&lt;a href="https://ollama.com/library/deepseek-r1" rel="noopener noreferrer"&gt;https://ollama.com/library/deepseek-r1&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After that it is enough to execute the following code to make RAG start working&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llama_index.core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StorageContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;load_index_from_storage&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llama_index.core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PromptTemplate&lt;/span&gt;  
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llama_index.llms.ollama&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Ollama&lt;/span&gt;

&lt;span class="c1"&gt;# rebuild storage context  
&lt;/span&gt;&lt;span class="n"&gt;storage_context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;StorageContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_defaults&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;persist_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="c1"&gt;# load index  
&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_index_from_storage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;storage_context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="c1"&gt;# Creating a prompt template  
&lt;/span&gt;&lt;span class="n"&gt;qa_prompt_tmpl_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt; &lt;span class="n"&gt;information&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;below&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; \&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;  
    &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;__&lt;/span&gt;\&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;  
    &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context_str&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; \&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;  
    &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;__in&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt; &lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="n"&gt;Given&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="n"&gt;information&lt;/span&gt; &lt;span class="n"&gt;above&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="n"&gt;want&lt;/span&gt; &lt;span class="n"&gt;you&lt;/span&gt; \&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;  
    &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;think&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="n"&gt;by&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;crisp&lt;/span&gt; \&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;  
    &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cause you don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="n"&gt;know&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="n"&gt;say&lt;/span&gt; \&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;”  
    “&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="n"&gt;don&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t know!&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; \&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;  
    &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;Query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query_str&lt;/span&gt;&lt;span class="p"&gt;}{&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;  
    &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;Answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;qa_prompt_tmpl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PromptTemplate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;qa_prompt_tmpl_str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="c1"&gt;# Setting up a query engine  
&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Ollama&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;deepseek&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;r1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;360.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  

&lt;span class="c1"&gt;# Setup a query engine on the index previously created  
&lt;/span&gt;&lt;span class="n"&gt;Settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="c1"&gt;# specifying the llm to be used  
&lt;/span&gt;&lt;span class="n"&gt;query_engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_query_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;  
    &lt;span class="n"&gt;similarity_top_k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;  
&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="n"&gt;query_engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_prompts&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;response_synthesizer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;text_qa_template&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;qa_prompt_tmpl&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;  
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;query_engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;What is the ACID?&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Final step. Interface
&lt;/h2&gt;

&lt;p&gt;We can embed this solution into a desktop client by creating a user interface using &lt;a href="https://streamlit.io" rel="noopener noreferrer"&gt;Streamlit&lt;/a&gt; to allow the user to interact with our RAG application via chat. Or you can make a Telegram bot, then it will be even easier to create an interface&lt;/p&gt;

</description>
      <category>deepseek</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
