Danny Chan for MongoDB Builders

Posted on Aug 11

🚀 教程：通过 MongoDB 向量搜索财务 PDF 报告 (本地嵌入)

#mongodb #vectordatabase

步骤 1：创建数据库集群
步骤 2：输入数据库集群信息
步骤 3：等待集群部署
步骤 4：添加网络访问白名单
步骤 5：添加数据库访问用户
步骤 6：连接到本地 Atlas 部署或 Atlas 集群
步骤 7：从 PDF 中检索文本
步骤 8：本地嵌入 PDF 文本然后插入 MongoDB
步骤 9：检查集合文档记录
步骤 10：创建向量搜索索引
步骤 11：通过向量搜索索引查询

步骤 1：创建数据库集群

步骤 2：输入数据库集群信息

步骤 3：等待集群部署

步骤 4：添加网络访问白名单

步骤 5：添加数据库访问用户

步骤 6：连接到本地 Atlas 部署或 Atlas 集群

步骤 7：从 PDF 中检索文本

步骤 8：本地嵌入 PDF 文本然后插入 MongoDB

pip install sentence-transformers==2.7.0
pip install pymongo==4.7.2
pip install langchain==0.2.6
pip install langchain-mongodb==0.1.5
pip install pandas==2.2.0
pip install langchain-openai==0.1.20
pip install langchain-chroma==0.1.0
pip install langchain-core==0.2.26
pip install langchain-huggingface==0.0.3
pip install langchain-mongodb==0.1.4

from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings

print("get documents")

data = ""
with open("./txt_final/payment.txt","r",encoding="utf8") as file:
    data = file.read()

print("Split txt into documents by page")

splits = data.split("www.iresearch.com.cn")

print("get model then embedding")

model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5")

print("Connect to your local Atlas deployment or Atlas Cluster")
mongo_client = MongoClient("mongodb+srv://<username>:<password>@internal-knowledge-base.xxxxx.mongodb.net/")

collection = mongo_client["internal-knowledge-base"]["papers"]

for split in splits:
    embedding = model.embed_query(split)
    collection.insert_one({ 'text_embedding': embedding, 'summary': split })

步骤 9：检查集合文档记录

PDF page 3

PDF page 4

Data structure

{
    "_id": "66b79fd22e6781dc9195820fL",
    "text_embedding": [0.019098538905382156, -0.0010181389516219497],
    "summary": "Diversified development paths for third-party payment platforms Third-party payment platforms integrate into every detail of consumer life through lightweight reach...."
}

步骤 10：创建向量搜索索引

{
  "fields": [
    {
      "type": "vector",
      "path": "text_embedding",
      "numDimensions": 1024,
      "similarity": "cosine"
    }
  ]
}

步骤 11：通过向量搜索索引查询

from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_mongodb import MongoDBAtlasVectorSearch
import pprint

print("Connect to your local Atlas deployment or Atlas Cluster")
mongo_client = MongoClient("mongodb+srv://<username>:<password>@internal-knowledge-base.xxxxx.mongodb.net/")

collection = mongo_client["internal-knowledge-base"]["papers"]

model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5")

vector_store = MongoDBAtlasVectorSearch(
   collection=collection,
   embedding=model,
   index_name="vector_index",
   embedding_key="text_embedding",
   text_key="summary"
)

query = "蚂蚁集团" # payment
results = vector_store.similarity_search(query)
pprint.pprint(results)

Result:
English version

[
    Document(metadata={'_id': {'$oid': '66b79fdc2e6781dc91958211'}}, page_content='Ant Group-Alipay Ecological Foundation}
    Document(metadata={'_id': {'$oid': '66b79fcd2e6781dc9195820e'}}, page_content='The competitive landscape of independent third-party payment platforms has formed, led by Alipay"}
    Document(metadata={'_id': {'$oid': '66b79fc32e6781dc9195820c'}}, page_content='Aikan Series Monthly Inventory of Tourism Activity in Scenic Areas"}
]

Chinese version

[
    Document(metadata={'_id': {'$oid': '66b79fdc2e6781dc91958211'}}, page_content='蚂蚁集团—支付宝生态筑基"}
    Document(metadata={'_id': {'$oid': '66b79fcd2e6781dc9195820e'}}, page_content='独立第三方支付平台竞争格局形成以支付宝为首"}
    Document(metadata={'_id': {'$oid': '66b79fc32e6781dc9195820c'}}, page_content='-艾瞰系列-景区旅游活跃度盘点月报"}
]

Reference:

https://python.langchain.com/v0.2/docs/tutorials/pdf_qa/
Build a PDF ingestion and Question/Answering system

https://www.mongodb.com/docs/atlas/atlas-vector-search/create-embeddings/
How to Create Vector Embeddings

https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/local-rag/#std-label-local-rag
Build a Local RAG Implementation with Atlas Vector Search

https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/#std-label-langchain
Get Started with the LangChain Integration

https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface/

Local Embeddings with HuggingFace

Editor