DEV Community

Cover image for 🚀 教程:通过 MongoDB 向量搜索财务 PDF 报告 (本地嵌入)
Danny Chan for MongoDB Builders

Posted on

2 1 1 1 1

🚀 教程:通过 MongoDB 向量搜索财务 PDF 报告 (本地嵌入)

步骤 1:创建数据库集群
步骤 2:输入数据库集群信息
步骤 3:等待集群部署
步骤 4:添加网络访问白名单
步骤 5:添加数据库访问用户
步骤 6:连接到本地 Atlas 部署或 Atlas 集群
步骤 7:从 PDF 中检索文本
步骤 8:本地嵌入 PDF 文本然后插入 MongoDB
步骤 9:检查集合文档记录
步骤 10:创建向量搜索索引
步骤 11:通过向量搜索索引查询



步骤 1:创建数据库集群

Image description

Image description



步骤 2:输入数据库集群信息

Image description

Image description

Image description

Image description

Image description

Image description



步骤 3:等待集群部署

Image description

Image description



步骤 4:添加网络访问白名单

Image description

Image description

Image description

Image description

Image description

Image description

Image description



步骤 5:添加数据库访问用户

Image description

Image description

Image description

Image description

Image description

Image description



步骤 6:连接到本地 Atlas 部署或 Atlas 集群

Image description

Image description

Image description

Image description

Image description

Image description



步骤 7:从 PDF 中检索文本

Image description

Image description



步骤 8:本地嵌入 PDF 文本然后插入 MongoDB

pip install sentence-transformers==2.7.0
pip install pymongo==4.7.2
pip install langchain==0.2.6
pip install langchain-mongodb==0.1.5
pip install pandas==2.2.0
pip install langchain-openai==0.1.20
pip install langchain-chroma==0.1.0
pip install langchain-core==0.2.26
pip install langchain-huggingface==0.0.3
pip install langchain-mongodb==0.1.4
Enter fullscreen mode Exit fullscreen mode
from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings
Enter fullscreen mode Exit fullscreen mode
print("get documents")

data = ""
with open("./txt_final/payment.txt","r",encoding="utf8") as file:
    data = file.read()
Enter fullscreen mode Exit fullscreen mode
print("Split txt into documents by page")

splits = data.split("www.iresearch.com.cn")
Enter fullscreen mode Exit fullscreen mode
print("get model then embedding")

model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5")
Enter fullscreen mode Exit fullscreen mode
print("Connect to your local Atlas deployment or Atlas Cluster")
mongo_client = MongoClient("mongodb+srv://<username>:<password>@internal-knowledge-base.xxxxx.mongodb.net/")

collection = mongo_client["internal-knowledge-base"]["papers"]

for split in splits:
    embedding = model.embed_query(split)
    collection.insert_one({ 'text_embedding': embedding, 'summary': split })
Enter fullscreen mode Exit fullscreen mode



步骤 9:检查集合文档记录

Image description

Image description

Image description

Image description

PDF page 3

Image description

PDF page 4

Image description

Data structure

{
    "_id": "66b79fd22e6781dc9195820fL",
    "text_embedding": [0.019098538905382156, -0.0010181389516219497],
    "summary": "Diversified development paths for third-party payment platforms Third-party payment platforms integrate into every detail of consumer life through lightweight reach...."
}
Enter fullscreen mode Exit fullscreen mode



步骤 10:创建向量搜索索引

Image description

Image description

Image description

Image description

Image description

{
  "fields": [
    {
      "type": "vector",
      "path": "text_embedding",
      "numDimensions": 1024,
      "similarity": "cosine"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Image description



步骤 11:通过向量搜索索引查询

from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_mongodb import MongoDBAtlasVectorSearch
import pprint
Enter fullscreen mode Exit fullscreen mode
print("Connect to your local Atlas deployment or Atlas Cluster")
mongo_client = MongoClient("mongodb+srv://<username>:<password>@internal-knowledge-base.xxxxx.mongodb.net/")

collection = mongo_client["internal-knowledge-base"]["papers"]

model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5")

vector_store = MongoDBAtlasVectorSearch(
   collection=collection,
   embedding=model,
   index_name="vector_index",
   embedding_key="text_embedding",
   text_key="summary"
)
Enter fullscreen mode Exit fullscreen mode
query = "蚂蚁集团" # payment
results = vector_store.similarity_search(query)
pprint.pprint(results)
Enter fullscreen mode Exit fullscreen mode

Result:
English version

[
    Document(metadata={'_id': {'$oid': '66b79fdc2e6781dc91958211'}}, page_content='Ant Group-Alipay Ecological Foundation}
    Document(metadata={'_id': {'$oid': '66b79fcd2e6781dc9195820e'}}, page_content='The competitive landscape of independent third-party payment platforms has formed, led by Alipay"}
    Document(metadata={'_id': {'$oid': '66b79fc32e6781dc9195820c'}}, page_content='Aikan Series Monthly Inventory of Tourism Activity in Scenic Areas"}
]
Enter fullscreen mode Exit fullscreen mode

Chinese version

[
    Document(metadata={'_id': {'$oid': '66b79fdc2e6781dc91958211'}}, page_content='蚂蚁集团—支付宝生态筑基"}
    Document(metadata={'_id': {'$oid': '66b79fcd2e6781dc9195820e'}}, page_content='独立第三方支付平台竞争格局形成以支付宝为首"}
    Document(metadata={'_id': {'$oid': '66b79fc32e6781dc9195820c'}}, page_content='-艾瞰系列-景区旅游活跃度盘点月报"}
]
Enter fullscreen mode Exit fullscreen mode



Reference:

https://python.langchain.com/v0.2/docs/tutorials/pdf_qa/
Build a PDF ingestion and Question/Answering system

https://www.mongodb.com/docs/atlas/atlas-vector-search/create-embeddings/
How to Create Vector Embeddings

https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/local-rag/#std-label-local-rag
Build a Local RAG Implementation with Atlas Vector Search

https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/#std-label-langchain
Get Started with the LangChain Integration

https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface/

Local Embeddings with HuggingFace


Editor

Image description

Danny Chan, specialty of FSI and Serverless

Image description

Kenny Chan, specialty of FSI and Machine Learning

Retry later

Top comments (0)

Retry later
Retry later