DEV Community

Cover image for Store Compressed Files in MongoDB with GridFS
MongoDB Guests for MongoDB

Posted on • Edited on

1 1 1 1 1

Store Compressed Files in MongoDB with GridFS

This tutorial is written by Kanin (James) Kearpimy

I’ve heard the story about an engineering team trying to utilize a document database as data storage. The main reason is “because we can wrap text and files as single content (document).” This statement has some good clues but is bad in practice; let’s explore why this is together.

This article will walk you through the thinking process of the document database mindset behind MongoDB—how it relates to files and in which scenario it is suitable to wrap file content inside of a document. Otherwise, the worst nightmare with a super large document with file encoded will knock on the door.

Document database and blob storage

Document databases can inherit the structure of the data they store. So, we can perform operations on data values such as filtering documents by their nested properties.
Document database mental model (ref: key-value and document database

So, the more structure the data has, the more powerfully that database is able to fully retrieve out of capacity.
Blob storage stores data in a non-hierarchical structure, most of the time. This is called a key-value mechanism, where values only store data without implication of its structure. We are unable to perform operations on string values. For example, JSON is stored as a string value. It, however, requires less overhead to store and retrieve data while trading off the nested properties’ implication.

Figure 2: Key-value database and document database

Which scenario is suitable for each?

Data source separation

Data source separation

Obviously, MongoDB is not a proper solution for overly large files due to maximum document (16MB) and overhead operation to imply nested property.

Figure 4: Storing large files in MongoDB is costly

We may see the standard approach where we dedicate a file into another source, blob storage, and store its destination in MongoDB. The system trades the network latency with database overhead.
Enter fullscreen mode Exit fullscreen mode

Figure 5: Saving only file destination<br>

MongoDB as single data source

Figure 6: MongoDB as single data source<br>

What if we come to the scenario that network latency is a bigger problem than database computational cost? For example, we can limit the file size to properly store in MongoDB a small image avatar or audio of a name pronunciation. Then, the system is able to reduce the complexity.

Hands-On

Prerequisites

You should install the below requirements to complete the hands-on tutorial.

  • Python (3.10 is recommended)
  • Python dependencies
    • FastAPI (0.115): Lightweight Python web-server
    • Uvicorn (0.32): ASGI web-server implementation for Python
    • PyMongo (4.10): Python driver for MongoDB database. We utilize this library to perform operations on MongoDB from a Python application.
    • Jinja2 (3.1): Templating to render HTML with Python data from a web server
  • MongoDB Atlas
  • Postman: The API client to interact with a web-server

This tutorial will go through the following steps:

  1. Understand basic system description: Explain the overall system and architecture.
  2. See the project structure: The project file directory for web-server application logic.
  3. Get to know routing: Web-server routing where a user's request is sent.
  4. Create a HTML file and a server in web.py: Explore how an index.html interacts with Jinja2 and how a web.py, the web-server, works. This is the file, consisting of user interface logic, to talk to a backend application that directly interacts with the user via a web browser.
  5. Save file as binary into MongoDB Atlas by GridFS: Go through GridFS, why it matters, and how it works with MongoDB.
  6. Save file into blob storage: This is in case MongoDB is not suitable. For example, for a large file, blob blob storage is an alternative. This section will explain how we create blob storage.

Understand basic system description

We will see the overall software architecture in Figure 7. This architecture visualizes how a user's request goes to a web-server interacting with MongoDB Atlas and Blob Storage. Each detailed step is described below.

Step 1) When clients go to a website homepage, they will see a form to upload an image (shown in Figure 10). When they upload, the client side—a web browser, for example—sends the user’s input information and image via network to the web-server.
Step 2) The web-server receives a request and then selects where to save the image between MongoDB Atlas (Step 3) and Blob Storage (Step 4). (We’ll see the decision logic to choose storage in the next few sections.)
Step 3) MongoDB Atlas stores the image as binary data.
Step 4) The image is stored as a blob object in Blob Storage.

Figure 7: Hands-on architecture

Let’s see the project structure

In your IDE, this is how your project will be structured. We will go through the set-up and the contents of each file together. At root directory, there is one file and one directory:

  • app/: This is the main application directory.
    • routers/: All routers, except for root, are stored here.
      • image.py: Handles image rendering requests such as the open image URL.
      • profile.py: Saving files from the HTTP request into MongoDB Atlas or Blob Storage.
    • web.py: Web-server routers are managed here.
    • storageService.py: Handles the MongoDB model and operation, and also Blob storage.
    • mongodb.py: Manages MongoDB logic such as connection, save, and find operations.
  • uploads/: Store images as blob storage.
  • main.py: The entry point file to run the whole application.

Figure 8: Project structure

Note: __init.py is an empty file for Python to detect the directory and all files in it as modules. So, please create __init_.py in every directory (except root one). _

Setting up the dependency

It's a good idea to create a separate environment for projects. We use pyenv to install a specific Python version. (See how to install pyenv).

Install Python version 3.10.9 in your terminal by typing the command below.

a pyenv install 3.10.9
Enter fullscreen mode Exit fullscreen mode

Set Python 3.10.0 as the local environment.

a pyenv local 3.10.9 
a python --version
Enter fullscreen mode Exit fullscreen mode

Create a virtual environment.

a python -m venv venv-3.10.9
a ls
Enter fullscreen mode Exit fullscreen mode

You will see venv-3.10.9.

Activate the virtual environment.

a source ./venv-3.10.9/bin/activate
Enter fullscreen mode Exit fullscreen mode

You'll know the virtual environment is activated when there's a (venv-3.10.9) at the beginning of your command line.

Now, you are ready to install the necessary dependencies. Please create requirements.txt in the project root directory and paste the below content in it.

annotated-types==0.7.0
anyio==4.6.2.post1
click==8.1.7
dnspython==2.7.0
exceptiongroup==1.2.2
fastapi==0.115.5
h11==0.14.0
idna==3.10
Jinja2==3.1.4
MarkupSafe==3.0.2
pydantic==2.10.0
pydantic_core==2.27.0
pymongo==4.10.1
python-multipart==0.0.17
sniffio==1.3.1
starlette==0.41.3
typing_extensions==4.12.2
uvicorn==0.32.1
Enter fullscreen mode Exit fullscreen mode

Then, in your terminal, type the below command to install dependencies.

pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Everything is all set. Let's review the project in detail.

Get to know routing

As we all know, users interact with websites. The backbone of the website is a server. We are going to create a web-server to handle user requests. All requests are fed to a server via router. There are four routes listed below.

Routing

Create web-server in web.py

In order to create a web-server, we should create a web.py in app/ directory. Firstly, we import all dependencies, and initialize some instances such as FastAPI for further web-server implementation.

Our app/web.py:

# import dependencies from FastAPI
# Form is tool to handle Form data from request
from fastapi import FastAPI
from fastapi.responses import JSONResponse


# initiate FastAPI server and bind it to “server” variable
server = FastAPI()


# Comment two below lines. We'll get back to them soon.
# from app.routers import profile
# server.include_router(profile.router, prefix="/profile", tags=["profile"])
# from app.routers import image
# server.include_router(image.router, prefix="/image", tags=["image"])




# declare http route as “/”, typically called root route.
@server.get("/")
def index():
   return JSONResponse(content={"response": {"message": "Server is up!"}})
Enter fullscreen mode Exit fullscreen mode

As you may see, we import several dependencies, including:

  • FastAPI: The web-server instance.
  • JSONResponse: Return responses in JSON format.

We create a root route (/) and return a message, “Server is up!” to guarantee that everything works fine. To successfully run a web-server, we append a bit of code into main.py in the root directory. The server variable below refers to the server variable that we bind to the FastAPI entity in app/web.py.

# In our main.py
from app.web import server
Enter fullscreen mode Exit fullscreen mode

Then, we can run the server. Open the terminal and run the below command. uvicorn will try to invoke the main.py file with the server variable to start the FastAPI web-server.

export HOST=http://127.0.0.1:8000
export MONGO_URI={-- MongoDB Atlas Connection String --}
python -m uvicorn main:server --reload
Enter fullscreen mode Exit fullscreen mode

You should see terminal logging like below.

Figure 1: running uvicorn in terminal

Now, the web-server is ready to receive requests. Please open Postman and create a GET request with http://127.0.0.1:8000 into the request tab. After sending a request, we should receive responses like below.

Figure 2: Postman request and response

Create profile router to receive files from user’s request to store

We are creating a new router /profile to receive images from the request and then store it in MongoDB or Blob Storage. Let’s create profile.py in app/routers/ and copy the below code.

Our app/routers/profile.py:

from fastapi import Form, File, APIRouter, UploadFile
from fastapi.responses import JSONResponse
import os
from app.storageService import storageService
from app.storageService import mongodb


router = APIRouter()


# HOST_URL is the hostname of the current web-server.
HOST_URL = os.getenv("HOST")


@router.post("/")
async def update_profile(
   profilePicture: UploadFile = File(...),
   fullName: str = Form(...),
   career: str = Form(...),
   id: str = Form(...),
):
   # define response type annotation as JSON.
   try:
       # split filename and extension for storageService
       _, file_extension = os.path.splitext(profilePicture.filename)


       # we pass three parameters including filename, filesize, and file content into storageService.save() method.
       file_id = storageService.save(
           f"{id}{file_extension}", profilePicture.size, profilePicture.file
       )
       file_url = f"{HOST_URL}/image/{file_id}"
       mongodb.save({"fullName": fullName, "career": career, "file_url": file_url})
       result = JSONResponse(content={"response": "OK", "image_url": file_url})
   except Exception as e:
       result = JSONResponse(content={"response": str(e)})


   return result
Enter fullscreen mode Exit fullscreen mode

You notice that we do not create a new Server. Instead, the APIRouter for FastAPI perceives this file as another sub-router of the whole application. Next, we create / router to wrap update_profile (when we merge it back to the whole application, we will set the prefix of this file as /profile.). The update_profile function receives four parameters in request, including:

  • profilePicture as file type.
  • fullName as string type.
  • career as string type.
  • id as string.

Inside the function, it splits the filename and file extension and then calls the image-saving logic in storageService.save(). If everything works properly, the function will return a response with the image_url of the uploaded image back to the client to display.
Next, create storageService.py in app/ directory and copy the below code to it.

# MongoDBModel is the MongoDB Model binding type.
import os
import shutil
from app.mongodb import MongoDBModel


class StorageService:


   CHUNK_SIZE = 16 * 1024 * 1024  # 16MB
   DATABASE: MongoDBModel
   UPLOAD_DIR = "uploads"  # Directory to store image file in Blob-like storage.


   def __init__(self, database: MongoDBModel) -> None:
       self.DATABASE = database
       pass


   def save(self, filename: str, size: float, file) -> None:
       # the logic detects file size and make decision
       # to store file in MongoDB or Blob storage
       # we limit 16 MB file size for MongoDB
       # otherwise, we save it to blob storage
       if (size / self.CHUNK_SIZE) > 1:
           # size more than CHUNK_SIZE > 3
           # system need three MongoDB documents to store the file
           # we would go for Blob storage
           # save file to blob. We stimulate the local storage as a blob
           file_path = os.path.join(self.UPLOAD_DIR, filename)
           if os.path.exists(file_path):
               os.remove(file_path)
           with open(file_path, "wb") as buffer:
               shutil.copyfileobj(file, buffer)


           return f"blob/{filename}"
       else:
           # size between self.CHUNK_SIZE and self.CHUNK_SIZE
           # save it into MongoDB by GridFS
           # save file into MongoDB
           name = self.DATABASE.saveFile(file, filename)
           return f"fs/{name}"


   # This is for retrieving image data from the MongoDB database in the service layer.
   def fetch(self, file_id: str):
       return self.DATABASE.fetch(file_id)




# instantiate MongoDBModel class and SorageService class.
# the storageService which we created to switch between MongoDB storage and Blob storage.
mongodb = MongoDBModel()
storageService = StorageService(mongodb)
Enter fullscreen mode Exit fullscreen mode

We have two functions here: save(), operation logic to store the file, and fetch(), retrieving the logic to pull the file and return to client’s request. The storageService.py works as a service layer class to handle everything about file storage. In this case, save() received three parameters:

  • Filename
  • File size (byte as unit)
  • File data (as binary)

If the file is small (chunking less than one piece), the system saves it into MongoDB Atlas. Basically, MongoDB documents, BSON, have a 16MB maximum size/document size. So, the size / self.CHUNK_SIZE > 1 means that if size is divided by CHUNK_SIZE, which is 16 MB more than one piece, it exceeds the document maximum. We have two choices here.

  1. Utilize GridFS to handle multiple chunks in MongoDB.
  2. Utilize nlob storage.
    • We go for this approach in this tutorial because memory complexity is expensive in our perspective (to store large data in MongoDB).

So, the function checks File size to designate the final target out of two alternatives to store the file.

  1. Save it as binary in MongoDB Atlas by GridFS.
  2. Save it in Blob Storage.

The fetch() function calls the internal variable DATABASE which is binding to MongoDBModel class in mongodb.py to retrieve file data from file_id.

To complete the file’s handler from the request, we should create a new mongodb.py in app/ and copy the file content below into it.

from pymongo import MongoClient
import os
import gridfs


class MongoDBModel:
   # internal variable
   # MongoDB connection string
   # MongoDB Client from PyMongo (MongoDB python driver)
   # library to handle large files in MongoDB.
   MONGO_URI: str
   CLIENT: MongoClient
   GRIDFS: gridfs.GridFS


   # we init class by assigning value to each variable.
   def __init__(self) -> None:
       self.MONGO_URI = os.getenv("MONGO_URI")  # MongoDB connection string
       client = MongoClient(self.MONGO_URI)
       self.DATABASE = client["gridfs-mongodb"]
       self.GRIDFS = gridfs.GridFS(client["gridfs-mongodb"])


   # saveFile is method to store large file into database with GridFS
   def saveFile(self, file, filename) -> str:
       if self.GRIDFS.exists(filename):
           self.GRIDFS.delete(filename)
       file_id = self.GRIDFS.put(file, filename=filename, _id=filename)
       return file_id


   # This method is to save documents into the MongoDB database with an upsert approach.
   # upsert means if there is an existing document then replace it, otherwise insert a new one.
   def save(self, docs) -> None:
       self.DATABASE["profiles"].update_one(
           {"fullName": docs["fullName"]}, {"$set": docs}, upsert=True
       )
       return


   # This is for retrieving image data.
   def fetch(self, file_id: str):
       return self.GRIDFS.get(file_id)
Enter fullscreen mode Exit fullscreen mode

The saveFile method removes the current file (if it exists) and saves the new file into the MongoDB collections. We try to imitate the upsert mechanism here. (Please note that when we update files in GridFS, we don’t directly update binary files. Instead, we create new chunks and update metadata because the system is unable to guarantee the same size/chunks in final form. Also, the save method is a simple update query of MongoDB to update document data (file_id, fullname, career). Finally, the fetch method calls the GRIDFS library to get the field by file_id and then return the file data back to the client.

Before we try sending requests, we have to go back to web.py and uncomment the below line.

from app.routers import profile
server.include_router(profile.router, prefix="/profile", tags=["profile"])
Enter fullscreen mode Exit fullscreen mode

Now, you can open another tab of Postman and create POST requests like below to http://127.0.0.1:8000/profile.
Postman POST request of /profile.

Note: profiePicture is file type in Postman. You can select an image file with .jpg, .png, and .gif extensions.

We should receive the response below. You can notice the image_url. It doesn’t yet work now but we must keep its value (image URL) for now.

Postman POST response of /profile.

Before we go further, we should understand how we store large files in MongoDB with GridFS.

Save file as binary into MongoDB Atlas by GridFS

GridFS is a library to store file data exceeding the maximum BSON document size (16MB). While we wish to limit the size of files beforehand not to go beyond maximum size, knowing how the original mechanism works will be helpful for edge cases.

How GridFS actually works

Typically, the system can store binary data into MongoDB documents for a very long time. However, storing data that exceeds the maximum size of documents is a constraint for a large file size. Imagine we have 10MB of file sizes.

how  to file / binary store in MongoDB

That is where GridFS comes in. It breaks down large files into smaller chunks of files and creates metadata to group all chunks for retrieve and remove operations.

GridFS background

GridFS defines a collection to store metadata as fs.files and chunks data as fs.chunks.

Here’s a metadata example in the fs.files collection. This schema is a file description. The chunkSize of each part is stored in the fs.chunks collection.

fs.files collection

Here’s an example of a binary file in the fs.chunks collection—the real file with binary data. The n is the order of files divided by GridFS.

fs.chunks collection

Now, we can store files in MongoDB Atlas (or Blob Storage). It’s time to see how we can retrieve data for utilization.

Implement rendering router for uploaded image

Create `image.py` in `app/routers` and copy the code below.
Enter fullscreen mode Exit fullscreen mode
# StreamingResponse handle streaming image data back to response.
# minetypes is here to acquire file's type out of file data.
from fastapi import APIRouter, HTTPException
from fastapi.responses import StreamingResponse
from app.storageService import storageService
import os
import mimetypes


router = APIRouter()


# UPLOAD_DIR is an upload directory to store Blob object files.
UPLOAD_DIR = "uploads"




# to let the reader focus on GridFS and MongoDB binary data.
# Below code is how the file is handled by GridFS, stored in the MongoDB database.
# and return them to user via
# http://127.0.0.1:8000/image/fs/{id of file}.png
@router.get("/fs/{file_id}")
async def display_image_fs(file_id: str):
   try:
       # retrieve files from MongoDB.
       file = storageService.fetch(file_id)


       # stream image back to response.
       def file_stream():
           while chunk := file.read(1024 * 1024):
               yield chunk


       # return image response.
       return StreamingResponse(
           file_stream(),
           media_type=file.content_type,
           headers={"Content-Disposition": f"attachment; filename={file.filename}"},
       )
   except Exception as e:
       raise HTTPException(status_code=500, detail=str(e))


# to let the reader focus on GridFS and MongoDB binary data.
# We abstract the Blob storage to write data on the disk of the server.
# so below code is how the Python open IO to read file on disk
# and return them to user via
# http://127.0.0.1:8000/image/blob/{id of file}.{extension}
@router.get("/blob/{file_id}")
async def display_image_blob(file_id: str):
   try:
       # we receive filename from user.
       # check if that filename exists on disk.
       filename = f"{UPLOAD_DIR}/{file_id}"
       if os.path.exists(filename):


           def file_stream():
               with open(filename, "rb") as file:
                   yield from file


           # return file data as Streaming.
           return StreamingResponse(
               file_stream(),
               media_type=mimetypes.guess_type(filename)[0],
               headers={"Content-Disposition": f"attachment; filename={filename}"},
           )
       else:
           # if file is not found, return error.
           return HTTPException(status_code=404, detail="Can't find image")
   except Exception as e:
       raise HTTPException(status_code=500, detail=str(e))
Enter fullscreen mode Exit fullscreen mode

There are two functions in this file.
display_image_fs: fetches data from MongoDB Atlas. This function is responsible for http://127.0.0.1:8000/image/fs/{id of file}.{extension}. Its logic is to receive file_id and call storageService.fetch which triggers GRIDFS to retrieve image data out of MongoDB Atlas.

display_image_blob: fetches data from blob storage (a local disk, in our case). This function is responsible for http://127.0.0.1:8000/image/blob/{id of file}.{extension}. Implementing code here is, similarly, getting file_id and fetching the file data from Blob Storage. In our case, it directly recalls data from the local disk in the uploads directory.

We can uncomment the below line of code in web.py to enable image rendering.

from app.routers import image
server.include_router(image.router, prefix="/image", tags=["image"])
Enter fullscreen mode Exit fullscreen mode

Then, you can copy the image_url from the previous section and paste it in the Postman GET request. The program should render image data like below.

image display from MongoDB Atlas

Now, all of the functionality is set. We can store image data in MongoDB, and retrieve it and respond to the client’s request.

Summary

Storing a file in MongoDB with GridFS has its own use cases and advantages. For example, we store an avatar that relates to specific content which enables a single source of truth for data retrieval and reduces latency of file access compared to blob storage outside the same environment, on another machine. Moreover, MongoDB can optimize data access by its default functions such as cache and indexing.

We’ve completely created the web-server to render a form for users to upload images and information. Then, we built a logic to select the storage between MongoDB and Blob Storage. We utilized GridFS as chunking management for binary files. Lastly, we developed the logic to return file data via the network to the user's web-browser.

In some cases, however, excessively large files can be accomplished with a combination of MongoDB and Blob Storage. We trade latency with MongoDB overhead such as indexing operations. This approach can reduce cost exchanging with lower speed to retrieve file data from another source.

Qodo Takeover

Introducing Qodo Gen 1.0: Transform Your Workflow with Agentic AI

Rather than just generating snippets, our agents understand your entire project context, can make decisions, use tools, and carry out tasks autonomously.

Read full post →

Top comments (0)

Qodo Takeover

Introducing Qodo Gen 1.0: Transform Your Workflow with Agentic AI

Rather than just generating snippets, our agents understand your entire project context, can make decisions, use tools, and carry out tasks autonomously.

Read full post

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay