DEV Community

Cover image for Your private PDF merge service
Alexis MP for Google Cloud

Posted on • Updated on

Your private PDF merge service

This post walks you through packaging an efficient Linux command to merge PDF files into a web app and hosting it on Cloud Run. You can then make it available to your friends and family or even within your enterprise via a simple browser or as an internal API to merge PDFs. Full code is in this GitHub repo and live service is here.
Alt Text

The magic Linux command to merge PDF files

There’s a pretty good discussion on StackOverflow comparing different approaches and results (speed, document size, etc…). I ended up using pdfunite but feel free to use something else here as the change would probably be a single liner.

Creating a (web) app from a CLI tool

The first step to making this available as a web app is to be able to invoke the command-line tool from your language of choice, in my case Java. I used ProcessBuilder, starting it with inheritIO() and Process.waitFor() completion.

This obviously requires that pdfunite is installed on the underlying OS, but more on that as we package the app in a container.

Popular Java web frameworks include Spring Boot, Micronaut, Spark, etc. or even raw HTTP Servlets. I went for SpringBoot and used a single controller to manage the MultipartFile upload and delegated all PDF file storage and manipulation to a dedicated service which takes care of removing from the filesystem every file it uses or generates.

The front-end is dead simple; it supports four file uploads and preserves the order of files being merged (the limit of four was chosen arbitrarily; the back-end imposes no such limit). There’s huge room for improvement as discussed in this README.

Note here that you really don’t have to use Java - since we’ll be packaging this as a container in the next step we really could have used any combination of language and framework here.

Packaging this in a Container (without a Dockerfile)

The one nice thing about Java is its vibrant ecosystem and in my case this means using Jib, an open source tool to build and push an optimized container image without having to write a Dockerfile or even install docker. In my case I used the Maven plugin and executed mvn compile jib:build (gradle is also supported).

Alt Text

The key thing here though is to make sure the containerized app will indeed be able to make the call to the pdfunite binary. There is no way with Jib to add linux packages so I replaced the default Jib base image in my pom.xml with a simple image based on openjdk:11-jre-slim and added the pdfunite binary (apt-get install poppler-utils). You actually do not need to build this image and can instead use the one I made available.

Once your app is packaged into a container, the image is pushed (as part of Jib’s build process) to the Google Cloud Registry for easier and faster deployments with this tag: gcr.io/PROJECT-ID/pdfmerger

Meet Cloud Run

Alt Text

Cloud Run is Google’s fully-managed solution that automatically scales stateless containers with a pay-per-use model. It is remarkably simple to use and a good fit to expose this small Java web application to the world or within your organization.

With our image being already available from Cloud Registry we can simply deploy a new service with gcloud :

$ gcloud run deploy --image gcr.io/PROJECT-ID/pdfmerger

You can provide options to specify the service name, the use of the managed platform (vs. hosting on a Kubernetes cluster), the deployment region. You can allow unauthenticated requests if you’d like to make this accessible as a public website or restrict its access via Cloud IAM (if this is meant to be a service accessible to other apps in your organization). Here are the relevant options when using console.cloud.google.com :

Alt Text
Alt Text
Alt Text

Because Cloud Run is built on Knative, you can also deploy the same app on Anthos GKE with Cloud Run for Anthos, a great option for some corporate settings.

Once deployment completes (which should be fairly quick), Cloud Run gives you a URL :

Alt Text
Alt Text

The 40 MB limitation seemed reasonable and is enforced using SpringBoot’s application.properties. I’ve also bumped up the memory setting for Cloud Run to 2 GB.

A word on concurrency

Cloud Run has concurrency built-in, which means that multiple requests can hit the same container instance. This is a great way to better utilize resources but also to limit cold starts. It however means that you need to build your service with concurrency in mind; or always use concurrent APIs and services.

In our case, this means that concurrent users (all those using a given container instance) share the same filesystem and it is important to keep that in mind to avoid mixing users files in any way. For this app, I addressed this by grouping files by user with a common prefix and preserving their order in a List.

Just a starting point

You can quickly deploy your own copy of this webapp using the Cloud Run button (linked off of the repo's README). This will encapsulate all the steps discussed so far and present you with a URL with the deployed service.

As it stands there are still a good number of limitations with this web app and I wouldn’t recommend sharing it broadly as-is. Hopefully it has however shown you one way that you can make a popular command-line tool available via a serverless web service.

You could also decide to take advantage of Google Cloud Storage or Google Drive integration for a better user experience, while of course sharing with your users how you are managing their files.

All of the code discussed is available in the GitHub repo and a live instance of the app is running here.

The sky's the limit

It’s not just about PDF! You can adapt this to use ImageMagick (to transform images), ffmpeg (e.g. to trim videos), Inkscape, or any other OS program that can be containerized. If you’re looking to generate a PDF from productivity suite formats, then consider using this similar example which is using LibreOffice.

Discussion (0)