Some Useful Pymongo Snippets

#pymongo #mongodb #python

## Table Of Contents
1. Display all databases available
2. Display all collections available
3. Display one document from collection
4. Display number of documents in the collection
5. Display top 10 document's specific field
6. Find 10 first authors name in ascending alphabetical order
7. Display the quantity of documents that has not regex pattern
8. Display/find number of documents uploaded between dates
9. Find documents by text search
10. Find documents includes pattern/regex in a field
11. Group by field and count documents, then sort by best
12. Update document if exists otherwise insert new document

I have a papers collection stored in Mongodb Atlas database. Example document:

{
    '_id': ObjectId('5fa9a4db76fdd8d66273c643'),
    'id': '0704.0001',
    'submitter': 'Pavel Nadolsky',
    'authors': "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan",
    'title': 'Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies',
    'comments': '37 pages, 15 figures; published version',
    'journal-ref': 'Phys.Rev.D76:013009,2007',
    'doi': '10.1103/PhysRevD.76.013009',
    'report-no': 'ANL-HEP-PR-07-12',
    'categories': 'hep-ph',
    'abstract': '  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative contributions from quark-antiquark,\ngluon-(anti)quark, and gluon-gluon subprocesses are included, as well as\nall-orders resummation of initial-state gluon radiation valid at\nnext-to-next-to-leading logarithmic accuracy. The region of phase space is\nspecified in which the calculation is most reliable. Good agreement is\ndemonstrated with data from the Fermilab Tevatron, and predictions are made for\nmore detailed tests with CDF and DO data. Predictions are shown for\ndistributions of diphoton pairs produced at the energy of the Large Hadron\nCollider (LHC). Distributions of the diphoton pairs from the decay of a Higgs\nboson are contrasted with those produced from QCD processes at the LHC, showing\nthat enhanced sensitivity to the signal can be obtained with judicious\nselection of events.\n',
    'update_date': '2008-11-26',
    'authors_parsed': [
        ['Balázs', 'C.', ''],
        ['Berger', 'E. L.', ''],
        ['Nadolsky', 'P. M.', ''],
        ['Yuan', 'C. -P.', '']
    ]
}

There are common use cases for my database related projects. So I prepared a list of snippets to be helpful when I needed it.

Initialization of pymongo:

from pymongo import MongoClient

#Connection to the Database
full_dns_name = 'mongodb://***'
username = 'test'
password = 'test'
authSource = 'admin'

client = MongoClient(host=full_dns_name, username=username, password=password, authSource=authSource)

1. Display all databases available:

#display all databases available
db_list = list(client.list_databases())
print(db_list)

#or

for db in client.list_databases():
    print(db)

2. Display all collections available:

#display all collections available
for db in client.list_databases():
    name = db['name']
    for col in client[name].list_collections():
        print(col)

3. Display one document from collection:

#display one document from "Papers" collection
db = client.arxiv
papers_col = db.papers
doc = papers_col.find_one()
print(doc)

4. Display number of documents in the collection:

#display number of documents in the collection
number_of_doc = papers_col.count_documents({})
print(number_of_doc)

5. Display top 10 document's specific field:

#display 10 articles titles
articles = list(papers_col.find({}, {'title': 1}).limit(10))

6. Find 10 first authors name in ascending alphabetical order:

from pymongo import ASCENDING
#"Submitter" attribute is author's name
#Display 10 first authors name in ascending alphabetical order

# sort, get 10
articles = list(
    papers_col.find({}, {'submitter': 1})
    .limit(10)
    .sort([('submitter', ASCENDING)])
)
print(articles)

7. Display the quantity of documents that has not regex pattern:

#Display the quantity of articles that has not published by "Damien Chablat"
pattern = re.compile(r'Damien Chablat')
articles = papers_col.count_documents({ 'submitter': { '$not': pattern } } )
print(articles)

8. Display/find number of documents uploaded between dates:

#"update_date" attibute contain documents upload date informations (yyyy-mm-dd format)
# Display number of article upload on 2014

from datetime import date

first_date = date.isoformat(date(2014,1,1))
last_date = date.isoformat(date(2015,1,1))

articles_count = papers_col.count_documents({'update_date':{'$gte':first_date,'$lt':last_date}})
print(articles_count)

9. Find documents by text search:

# Display an article title where "Machine Learning" is metionned in the abstract

papers_col.create_index([("abstract", TEXT)])
articles = papers_col.find({"$text":{"$search": "Machine Learning"}},{'abstract':1})

10. Find documents includes pattern/regex in a field:

# Display an article title where "Machine Learning" is metionned in the abstract

pattern = re.compile(r'Machine Learning')
articles = papers_col.find({ 'abstract': { '$regex': pattern } } )

print(list(articles))

11. Group by field and count documents, then sort by best:

#Display the amount of publications/articles for the 10 best submitters

pipeline = [
    { "$group": {"_id": "$submitter", "count": {"$sum": 1}} },
    { "$sort": { "count": -1 } },
    { '$limit': 10 }
]

articles = list(papers_col.aggregate(pipeline))
print(articles)

12. Update document if exists otherwise insert new document:

def  update_or_create_paper(paper_data):
    # update 'data' if custom 'id' exists otherwise insert new document
    return collection.find_one_and_update({"id": paper_data['id']},
                               {"$set": {"data": {**paper_data}}},
                               upsert=True)

DEV Community

Some Useful Pymongo Snippets

Top comments (0)