步骤 1:创建数据库集群
步骤 2:输入数据库集群信息
步骤 3:等待集群部署
步骤 4:添加网络访问白名单
步骤 5:添加数据库访问用户
步骤 6:连接到本地 Atlas 部署或 Atlas 集群
步骤 7:从 PDF 中检索文本
步骤 8:本地嵌入 PDF 文本然后插入 MongoDB
步骤 9:检查集合文档记录
步骤 10:创建向量搜索索引
步骤 11:通过向量搜索索引查询
步骤 1:创建数据库集群
步骤 2:输入数据库集群信息
步骤 3:等待集群部署
步骤 4:添加网络访问白名单
步骤 5:添加数据库访问用户
步骤 6:连接到本地 Atlas 部署或 Atlas 集群
步骤 7:从 PDF 中检索文本
步骤 8:本地嵌入 PDF 文本然后插入 MongoDB
pip install sentence-transformers==2.7.0
pip install pymongo==4.7.2
pip install langchain==0.2.6
pip install langchain-mongodb==0.1.5
pip install pandas==2.2.0
pip install langchain-openai==0.1.20
pip install langchain-chroma==0.1.0
pip install langchain-core==0.2.26
pip install langchain-huggingface==0.0.3
pip install langchain-mongodb==0.1.4
from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings
print("get documents")
data = ""
with open("./txt_final/payment.txt","r",encoding="utf8") as file:
data = file.read()
print("Split txt into documents by page")
splits = data.split("www.iresearch.com.cn")
print("get model then embedding")
model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5")
print("Connect to your local Atlas deployment or Atlas Cluster")
mongo_client = MongoClient("mongodb+srv://<username>:<password>@internal-knowledge-base.xxxxx.mongodb.net/")
collection = mongo_client["internal-knowledge-base"]["papers"]
for split in splits:
embedding = model.embed_query(split)
collection.insert_one({ 'text_embedding': embedding, 'summary': split })
步骤 9:检查集合文档记录
PDF page 3
PDF page 4
Data structure
{
"_id": "66b79fd22e6781dc9195820fL",
"text_embedding": [0.019098538905382156, -0.0010181389516219497],
"summary": "Diversified development paths for third-party payment platforms Third-party payment platforms integrate into every detail of consumer life through lightweight reach...."
}
步骤 10:创建向量搜索索引
{
"fields": [
{
"type": "vector",
"path": "text_embedding",
"numDimensions": 1024,
"similarity": "cosine"
}
]
}
步骤 11:通过向量搜索索引查询
from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_mongodb import MongoDBAtlasVectorSearch
import pprint
print("Connect to your local Atlas deployment or Atlas Cluster")
mongo_client = MongoClient("mongodb+srv://<username>:<password>@internal-knowledge-base.xxxxx.mongodb.net/")
collection = mongo_client["internal-knowledge-base"]["papers"]
model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5")
vector_store = MongoDBAtlasVectorSearch(
collection=collection,
embedding=model,
index_name="vector_index",
embedding_key="text_embedding",
text_key="summary"
)
query = "蚂蚁集团" # payment
results = vector_store.similarity_search(query)
pprint.pprint(results)
Result:
English version
[
Document(metadata={'_id': {'$oid': '66b79fdc2e6781dc91958211'}}, page_content='Ant Group-Alipay Ecological Foundation}
Document(metadata={'_id': {'$oid': '66b79fcd2e6781dc9195820e'}}, page_content='The competitive landscape of independent third-party payment platforms has formed, led by Alipay"}
Document(metadata={'_id': {'$oid': '66b79fc32e6781dc9195820c'}}, page_content='Aikan Series Monthly Inventory of Tourism Activity in Scenic Areas"}
]
Chinese version
[
Document(metadata={'_id': {'$oid': '66b79fdc2e6781dc91958211'}}, page_content='蚂蚁集团—支付宝生态筑基"}
Document(metadata={'_id': {'$oid': '66b79fcd2e6781dc9195820e'}}, page_content='独立第三方支付平台竞争格局形成以支付宝为首"}
Document(metadata={'_id': {'$oid': '66b79fc32e6781dc9195820c'}}, page_content='-艾瞰系列-景区旅游活跃度盘点月报"}
]
Reference:
https://python.langchain.com/v0.2/docs/tutorials/pdf_qa/
Build a PDF ingestion and Question/Answering system
https://www.mongodb.com/docs/atlas/atlas-vector-search/create-embeddings/
How to Create Vector Embeddings
https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/local-rag/#std-label-local-rag
Build a Local RAG Implementation with Atlas Vector Search
https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/#std-label-langchain
Get Started with the LangChain Integration
https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface/
Local Embeddings with HuggingFace
Editor
Danny Chan, specialty of FSI and Serverless
Kenny Chan, specialty of FSI and Machine Learning
Top comments (0)