Akriti Upadhyay

Posted on Feb 13, 2024

Steps to Build Chinese Language AI Using DeepSeek and Qdrant

Introduction

As we step into the Chinese New Year - the Year of the Dragon - in 2024, I thought: why not build a Chinese News AI using DeepSeek and Qdrant? Especially so because, though LLMs are becoming vast in size and complexity, there are still a range of challenges in building an accurate and efficacious language AI beyond the English Language. In this context, DeepSeek LLM has embarked on a long-term project to overcome the inaccuracies in language application. DeepSeek LLM excels in the Chinese language, and here we’ll see how it performs while fetching news from a Chinese news dataset. We’ll utilize LlamaIndex, FastEmbed by Qdrant, and Qdrant Vector Store to develop an application where we can understand Chinese news with a robust RAG.

Let’s dive deeper!

DeepSeek LLM: An Open-Source Language Model with Longtermism

DeepSeek LLM is an advanced language model that has developed different models including Base and Chat. It has been trained from scratch on a dataset of 2 trillion tokens in both English and Chinese. In terms of size, there are two varieties of DeepSeek LLM models: one comprises 7 billion parameters and the other, 67 billion parameters.

The 7B model uses Multi-Head Attention, while the 67B model uses Grouped-Query Attention. These variants operate on the same architecture as the Llama 2 model, which is an autoregressive transformer decoder model.

DeepSeek LLM is a project dedicated to advancing open-source large language models in the long-term perspective. However, DeepSeek LLM 67B outperforms Llama 2 70B in various domains such as reasoning, mathematics, coding, and comprehension. Compared to other models, including GPT 3.5, DeepSeek LLM excels in Chinese language proficiency.

The alignment pipeline of DeepSeek LLM consists of two stages:

Supervised Fine-Tuning: The 7B model is fine-tuned for 4 epochs, while the 67B model is fine-tuned for 2 epochs. During supervised fine-tuning, the learning rate for the 7B model is 1e-5, while for the 67B model it is 5e-6. The repetition ratio of the model tends to increase when the quantity of math SFT data increases, although the math SFT data includes similar patterns in reasoning.
Direct Preference Optimization: To address the repetition problem, the model's ability was enhanced using DPO training, which proved to be an effective method for LLM alignment. The preference data for DPO training is constructed in terms of helpfulness and harmlessness.

The model, along with all its variants, is available on Hugging Face. To learn more about DeepSeek LLM, visit their paper and Github repository.

Qdrant: A High-Performance Vector Database

Qdrant is an open-source vector database and vector similarity search engine written in Rust, engineered to empower the next generation of AI applications with advanced and high-performing vector similarity search technology. Its key features include multilingual support, which enables versatility across various data types, and filters for a wide array of applications.

Qdrant boasts speed and accuracy through a custom modification of the HNSW algorithm for Approximate Nearest Neighbor Search, which ensures state-of-the-art search capabilities by maintaining precise results. Moreover, it supports additional payload associated with vectors by allowing filterable results based on payload values. With a rich array of supported data types and query conditions, including string matching, numerical ranges, and geo-locations, Qdrant offers versatility in data management.

As a cloud-native and horizontally scalable platform, Qdrant efficiently handles data scaling needs by effectively utilizing computational resources through dynamic query planning and payload data indexing. Its applications include semantic text search, recommendations, user behavior analysis, and more, which offers a production-ready service with a convenient API for storing, searching, and managing vectors along with additional payload. For further details, the Qdrant Documentation provides a comprehensive guide on installation, usage, tutorials, and examples.

Utilizing FastEmbed for Lightweight Embedding Generation

FastEmbed is a lightweight, fast, and accurate Python library built specifically for embedding generation, with maintenance overseen by Qdrant. It achieves efficiency and speed through the utilization of quantized model weights and ONNX Runtime, which sidesteps the necessity for a PyTorch dependency.

FastEmbed supports data parallelism for encoding vast datasets efficiently and is engineered with a CPU-centric approach. Moreover, it excels in accuracy and recall metrics compared to OpenAI Ada-002 by boasting the Flag Embedding as its default model, which leads the MTEB leaderboard. It also supports Jina Embedding and Text Embedding. There are several popular text models supported by FastEmbed. To learn more about the supported models, visit here.

LlamaIndex Framework for Robust RAG

LlamaIndex is a robust framework ideally suited for constructing Retrieval-Augmented Generation (RAG) applications. It facilitates the decoupling of chunks used for retrieval and synthesis, which is a crucial feature because the optimal representation for retrieval may differ from that for synthesis. As document volumes expand, LlamaIndex supports structured retrieval by ensuring more precise outcomes, particularly when a query is only relevant to a subset of documents.

Moreover, LlamaIndex prioritizes optimized performance by offering an array of strategies to enhance the RAG pipeline efficiently. It aims to elevate retrieval and generation accuracy across complex datasets by mitigating hallucinations. LlamaIndex supports many embedding models as well as integration with large language models. Additionally, it seamlessly integrates with established technological platforms like LangChain, Flask, and Docker, which offers customization options such as seeding tree construction with custom summary prompts.

Understanding Chinese News with DeepSeek

Since DeepSeek LLM excels in Chinese language proficiency, let’s build a Chinese News AI using Retrieval Augmented Generation.

To get started, let’s install all the dependencies.

%pip install -q llama-index transformers datasets
%pip install -q llama-cpp-python
%pip install -q qdrant-client
%pip install -q llama_hub
%pip install -q fastembed

Here, I have used this dataset; it is a multilingual news dataset. I have picked the Chinese language. Load the dataset and save it to your directory. We'll be using LlamaIndex to read the data with SimpleDirectoryReader.

from datasets import load_dataset
dataset = load_dataset("intfloat/multilingual_cc_news", languages=["zh"], split="train")
dataset.save_to_disk("Notebooks/dataset")

Now, using LlamaIndex, load the data from the directory where we saved our dataset.

from llama_index import SimpleDirectoryReader
documents = SimpleDirectoryReader("Notebooks/dataset").load_data()

After that, split the documents into small chunks using SentenceSplitter. Here, we need to maintain the relationship between the document and the source document index so that it helps in injecting the document metadata.

from llama_index.node_parser.text import SentenceSplitter
text_parser = SentenceSplitter(chunk_size=1024,)
text_chunks = []
doc_idxs = []
for doc_idx, doc in enumerate(documents):
    cur_text_chunks = text_parser.split_text(doc.text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

Then, we’ll construct nodes from text chunks manually.

from llama_index.schema import TextNode
nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(text=text_chunk,)
    src_doc = documents[doc_idxs[idx]]
    node.metadata = src_doc.metadata
    nodes.append(node)

For each node, we’ll generate embeddings using the FastEmbed Embedding model.

from llama_index.embeddings import FastEmbedEmbedding

embed_model = FastEmbedEmbedding(model_name="BAAI/bge-small-en-v1.5")
for node in nodes:
    node_embedding = embed_model.get_text_embedding(node.get_content(metadata_mode="all"))
    node.embedding = node_embedding

Now, it's time to load the DeepSeek LLM using HuggingFaceLLM from LlamaIndex. Here, I used the chat model.

from llama_index.llms import HuggingFaceLLM
llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    tokenizer_name="deepseek-ai/deepseek-llm-7b-chat",
    model_name="deepseek-ai/deepseek-llm-7b-chat",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    model_kwargs={"torch_dtype": torch.float16}
)

Then, we'll define the ServiceContext, which consists of the embedding model and the large language model.

from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)

After that, we'll create a vector store collection using the Qdrant vector database and create a storage context using the vector store collection.

import qdrant_client
from llama_index.vector_stores.qdrant import QdrantVectorStore
client = qdrant_client.QdrantClient(location=":memory:")
from llama_index.storage.storage_context import StorageContext
from llama_index import (VectorStoreIndex,
                         ServiceContext,
                         SimpleDirectoryReader,)
vector_store = QdrantVectorStore(client=client, collection_name="my_collection")
vector_store.add(nodes)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

We’ll pass the documents, storage context, and service context into the VectorStoreIndex.

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, service_context=service_context)

We’ll generate a query embedding using a query string to build a retrieval pipeline.

query_str = "Can you give me news around IPhone?"
query_embedding = embed_model.get_query_embedding(query_str)

Then, we’ll construct a Vector Store query and query the vector database.

from llama_index.vector_stores import VectorStoreQuery
query_mode = "default"
vector_store_query = VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=2, mode=query_mode)
query_result = vector_store.query(vector_store_query)
print(query_result.nodes[0].get_content())

The following will be the result:

8%，僅排名第五，位居華為、OPPO和Vivo等本土手機廠商之後----這三家中國手機廠商的市場份額加起來達到47%。
普遍對iPhone 8更期待
除此之外，即將發布的iPhone 7恐怕會成為大獲成功的iPhone 6的"犧牲品"。去年第一季度，得益於iPhone 6銷量激增，蘋果在中國的營收增長了74%。一年後，由於iPhone 6S銷量疲軟，蘋果在全球的iPhone銷量首次出現下滑，而公司營收更是出現13年來的首次滑坡，盡管根據市場研究機構Strategy Analytics的數據，iPhone 6S是今年第二季度全球最暢銷的智能手機。
到目前為止，新浪微博網友對iPhone 7發布的討論，已經超過了去年iPhone 6S發布前的熱度。部分中國用戶甚至已經開始盤算著購買有望於明年發布的iPhone8。鑒於2017年是iPhone上市十周年，外界預計iPhone 8將作出更大的升級。
由於投資者擔心iPhone銷量已過巔峰，蘋果股價在今年始終承受著壓力。盡管今年以來蘋果股價累計上漲了2.35%，但仍然落後於標准普爾500指數的平均漲幅。
市場研究機構Stratechery科技行業分析師本·湯普森（Ben Thompson）說："相比2014年，今天最大的變化就是iPhone已無處不在。當人們第一次獲得購買iPhone的機會時，它還有巨大的增長空間，但如今那種潛力已經得到充分挖掘。"◎ 陳國雄
美國在1978年立法制定國際銀行法(International Banking Act of 1978)，該法將外國銀行業納入與國內銀行相同準則。在此之前，外國銀行設立係依據州法沒有一致性。
1978年制定國際銀行法後，外國銀行設立採雙規制(Dual System)，可向聯邦銀行管理機構OCC (Office of the Comptroller of the Currency)或州銀行 (State Banking Department)當局申請，如果向州申請設立毋須經過聯邦銀行同意。到了1991年外國銀行在美迅速成長，大約有280家外國銀行，資產值達6,260億美元，佔美國銀行總資產18%，大部份是依州法設立。

The response is: ‘At 8 percent, it ranked fifth behind local handset makers Huawei, OPPO, and Vivo - the three Chinese handset makers with a combined market share of 47 percent.
Widespread anticipation for iPhone 8
On top of that, the upcoming iPhone 7 is feared to be a "casualty" of the hugely successful iPhone 6. In the first quarter of last year, Apple's China revenue grew 74 percent, thanks to a surge in iPhone 6 sales. A year later, Apple's global iPhone sales fell for the first time due to weak iPhone 6S sales, and the company's revenue slipped for the first time in 13 years, even though the iPhone 6S was the world's best-selling smartphone in the second quarter of this year, according to market researcher Strategy Analytics.
So far, discussions on Sina Weibo about the release of the iPhone 7 have surpassed the buzz surrounding the release of the iPhone 6S last year. Some Chinese users have even begun planning to buy the iPhone 8, which is expected to be released next year, and is expected to get even bigger upgrades given that 2017 marks the 10th anniversary of the iPhone's launch.
As investors are worried that iPhone sales have peaked, Apple shares have been under pressure this year. Although Apple's stock price has risen 2.35% since the beginning of the year, it still lags behind the average rate of increase of the Standard & Poor's 500 Index.
Market research organization Stratechery technology industry analyst Ben Thompson (Ben Thompson) said: "Compared to 2014, the biggest change today is that the iPhone is everywhere. When people first got the chance to buy an iPhone, there was huge room for growth, but today that potential has been fully realized." ◎ Chen Guoxiong
The U.S. legislated the International Banking Act of 1978, which brought the foreign banking industry under the same criteria as domestic banks. Prior to that, foreign banks were established under state law with no consistency.
After the enactment of the International Banking Act of 1978, the establishment of foreign banks is based on a Dual System, which can be applied to the OCC (Office of the Comptroller of the Currency), the federal banking agency, or the State Banking Department (State Banking Department), and does not require the consent of the federal banking agency. The federal bank's consent is not required if the application is made to a state. By 1991, foreign banks were growing rapidly in the U.S. There were about 280 foreign banks with assets of US$626 billion, accounting for 18% of total U.S. banking assets, most of which were established under state law.’

Then we’ll parse the results into a set of nodes.

from llama_index.schema import NodeWithScore
from typing import Optional
nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
 score: Optional[float] = None
 if query_result.similarities is not None:
     score = query_result.similarities[index]
     nodes_with_scores.append(NodeWithScore(node=node, score=score))

Now, using the above, we'll create a retriever class.

from llama_index import QueryBundle
from llama_index.retrievers import BaseRetriever
from typing import Any, List

class VectorDBRetriever(BaseRetriever):
    """Retriever over a qdrant vector store."""
    def __init__(self,
                 vector_store: QdrantVectorStore,
                 embed_model: Any,
                 query_mode: str = "default",
                 similarity_top_k: int = 2) -> None:
        """Init params."""
        self._vector_store = vector_store
        self._embed_model = embed_model
        self._query_mode = query_mode
        self._similarity_top_k = similarity_top_k
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve."""
        query_embedding = embed_model.get_query_embedding(
            query_bundle.query_str
        )
        vector_store_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self._similarity_top_k,
            mode=self._query_mode,
        )
        query_result = vector_store.query(vector_store_query)
        nodes_with_scores = []
        for index, node in enumerate(query_result.nodes):
            score: Optional[float] = None
            if query_result.similarities is not None:
                score = query_result.similarities[index]
            nodes_with_scores.append(NodeWithScore(node=node, score=score))
        return nodes_with_scores

retriever = VectorDBRetriever(
    vector_store, embed_model, query_mode="default", similarity_top_k=2
)

Then, create a retriever query engine.

from llama_index.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    retriever, service_context=service_context
)

Finally, our retriever is ready to query and chat with. Let’s pass a query.

The query is: “Tell me about South China Sea Issue.”

query_str = "告诉我南海问题"
response = query_engine.query(query_str)
print(str(response))

Following will be the response:

南海问题是指涉及南海地区多个国家的主权和海洋权益争议的问题。该地区包括南海诸岛及其附近海域，涉及中国、菲律宾、越南、马来西亚、文莱和台湾等国家和地区。南海地区拥有丰富的油气资源，因此争议各方在该地区的领土和资源开发上存在分歧。中国主张对南海诸岛及其附近海域拥有主权，并提出"九段线"主张，而其他国家则对此持有不同看法。南海问题涉及复杂的政治、经济和安全利益，是地区和国际社会关注的焦点之一。

The response is: ‘South China Sea issues refer to issues involving disputes over the sovereignty and maritime rights and interests of multiple countries in the South China Sea region. This area includes the South China Sea Islands and their adjacent waters, involving countries and regions such as China, the Philippines, Vietnam, Malaysia, Brunei and Taiwan. The South China Sea is rich in oil and gas resources, so the parties to the dispute have differences over the territory and resource development in the area. China claims sovereignty over the South China Sea islands and adjacent waters and has proposed a "nine-dash line" claim, while other countries hold different views on this. The South China Sea issue involves complex political, economic, and security interests and is one of the focuses of attention of the regional and international communities.’

Conclusion

DeepSeek LLM has performed very well in answering questions without facing challenges. Its architecture sets it apart from other models, and the most impressive aspect is its utilization of Direct Preference Optimization to enhance the model’s capabilities. It is a fine-tuned and optimized model in both Chinese and English languages, and we observed from the results how well such a fine-tuned and optimized model can perform.

We utilized FastEmbed and Qdrant for embedding generation and vector similarity search. Retrieval was fast using Qdrant. One of Qdrant's most impressive features is its accessibility through Docker installation, on the cloud, and its in-memory capabilities. Qdrant is versatile in storing vector embeddings. It was intriguing for me to experiment with these tools.

Thanks for reading!