DEV Community

Cover image for 🚀 Tutorial: local embedding financial PDF reports by MongoDB vector search
Danny Chan for MongoDB Builders

Posted on

1 1 1 1 1

🚀 Tutorial: local embedding financial PDF reports by MongoDB vector search

Step 1: Create database clusters
Step 2: Input database clusters information
Step 3: Waiting clusters deploy
Step 4: Add network access whitelist
Step 5: Add database access user
Step 6: Connect to your local Atlas deployment or Atlas Cluster
Step 7: retrieve text from PDF
Step 8: local embedding PDF text then insert MongoDB
Step 9: check collections document record
Step 10: Create vector search index
Step 11: query by vector search index



Step 1: Create database clusters

Image description

Image description



Step 2: Input database clusters information

Image description

Image description

Image description

Image description

Image description

Image description



Step 3: Waiting clusters deploy

Image description

Image description



Step 4: Add network access whitelist

Image description

Image description

Image description

Image description

Image description

Image description

Image description



Step 5: Add database access user

Image description

Image description

Image description

Image description

Image description

Image description



Step 6: Connect to your local Atlas deployment or Atlas Cluster

Image description

Image description

Image description

Image description

Image description

Image description



Step 7: retrieve text from PDF

Image description

Image description



Step 8: local embedding PDF text then insert MongoDB



pip install sentence-transformers==2.7.0
pip install pymongo==4.7.2
pip install langchain==0.2.6
pip install langchain-mongodb==0.1.5
pip install pandas==2.2.0
pip install langchain-openai==0.1.20
pip install langchain-chroma==0.1.0
pip install langchain-core==0.2.26
pip install langchain-huggingface==0.0.3
pip install langchain-mongodb==0.1.4


Enter fullscreen mode Exit fullscreen mode


from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings


Enter fullscreen mode Exit fullscreen mode


print("get documents")

data = ""
with open("./txt_final/payment.txt","r",encoding="utf8") as file:
    data = file.read()


Enter fullscreen mode Exit fullscreen mode


print("Split txt into documents by page")

splits = data.split("www.iresearch.com.cn")


Enter fullscreen mode Exit fullscreen mode


print("get model then embedding")

model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5")


Enter fullscreen mode Exit fullscreen mode


print("Connect to your local Atlas deployment or Atlas Cluster")
mongo_client = MongoClient("mongodb+srv://<username>:<password>@internal-knowledge-base.xxxxx.mongodb.net/")

collection = mongo_client["internal-knowledge-base"]["papers"]

for split in splits:
    embedding = model.embed_query(split)
    collection.insert_one({ 'text_embedding': embedding, 'summary': split })


Enter fullscreen mode Exit fullscreen mode



Step 9: check collections document record

Image description

Image description

Image description

Image description

PDF page 3

Image description

PDF page 4

Image description

Data structure



{
    "_id": "66b79fd22e6781dc9195820fL",
    "text_embedding": [0.019098538905382156, -0.0010181389516219497],
    "summary": "Diversified development paths for third-party payment platforms Third-party payment platforms integrate into every detail of consumer life through lightweight reach...."
}


Enter fullscreen mode Exit fullscreen mode



Step 10: Create vector search index

Image description

Image description

Image description

Image description

Image description



{
  "fields": [
    {
      "type": "vector",
      "path": "text_embedding",
      "numDimensions": 1024,
      "similarity": "cosine"
    }
  ]
}


Enter fullscreen mode Exit fullscreen mode

Image description



Step 11: query by vector search index



from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_mongodb import MongoDBAtlasVectorSearch
import pprint


Enter fullscreen mode Exit fullscreen mode


print("Connect to your local Atlas deployment or Atlas Cluster")
mongo_client = MongoClient("mongodb+srv://<username>:<password>@internal-knowledge-base.xxxxx.mongodb.net/")

collection = mongo_client["internal-knowledge-base"]["papers"]

model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5")

vector_store = MongoDBAtlasVectorSearch(
   collection=collection,
   embedding=model,
   index_name="vector_index",
   embedding_key="text_embedding",
   text_key="summary"
)


Enter fullscreen mode Exit fullscreen mode


query = "蚂蚁集团" # payment
results = vector_store.similarity_search(query)
pprint.pprint(results)


Enter fullscreen mode Exit fullscreen mode

Result:
English version



[
    Document(metadata={'_id': {'$oid': '66b79fdc2e6781dc91958211'}}, page_content='Ant Group-Alipay Ecological Foundation}
    Document(metadata={'_id': {'$oid': '66b79fcd2e6781dc9195820e'}}, page_content='The competitive landscape of independent third-party payment platforms has formed, led by Alipay"}
    Document(metadata={'_id': {'$oid': '66b79fc32e6781dc9195820c'}}, page_content='Aikan Series Monthly Inventory of Tourism Activity in Scenic Areas"}
]


Enter fullscreen mode Exit fullscreen mode

Chinese version



[
    Document(metadata={'_id': {'$oid': '66b79fdc2e6781dc91958211'}}, page_content='蚂蚁集团—支付宝生态筑基"}
    Document(metadata={'_id': {'$oid': '66b79fcd2e6781dc9195820e'}}, page_content='独立第三方支付平台竞争格局形成以支付宝为首"}
    Document(metadata={'_id': {'$oid': '66b79fc32e6781dc9195820c'}}, page_content='-艾瞰系列-景区旅游活跃度盘点月报"}
]


Enter fullscreen mode Exit fullscreen mode



Reference:

https://python.langchain.com/v0.2/docs/tutorials/pdf_qa/
Build a PDF ingestion and Question/Answering system

https://www.mongodb.com/docs/atlas/atlas-vector-search/create-embeddings/
How to Create Vector Embeddings

https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/local-rag/#std-label-local-rag
Build a Local RAG Implementation with Atlas Vector Search

https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/#std-label-langchain
Get Started with the LangChain Integration

https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface/

Local Embeddings with HuggingFace


Editor

Image description

Danny Chan, specialty of FSI and Serverless

Image description

Kenny Chan, specialty of FSI and Machine Learning

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more

Top comments (0)

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more