Have you ever had to create a multi-page PDF document from individual files?
This post walks you through packaging an efficient Linux command merging PDF files, into a web app and hosting it on Cloud Run. You can then make this privacy-preserving service available to your friends and family or even within your enterprise via a simple browser. Full code is in this GitHub repo and live service is here.
There’s a pretty good discussion on StackOverflow comparing different approaches and results (speed, document size, etc…). I ended up using pdfunite but feel free to use something else. The change would probably be a single liner.
The first step to making this available as a web app is to be able to invoke the command-line tool from your language of choice, in my case Java. I used
ProcessBuilder, starting it with
This obviously requires that
pdfunite is installed on the underlying OS, but more on that as we package the app in a container.
Popular Java web frameworks include Spring Boot, Micronaut, Spark, etc. or even raw HTTP Servlets. I went for SpringBoot and used a single controller to manage the
MultipartFile upload and delegated all PDF file storage and manipulation to a dedicated service which takes care of removing from the filesystem every file it uses or generates.
The front-end is dead simple; it supports four file uploads and preserves the order of files being merged (the limit of four was chosen arbitrarily; the back-end imposes no such limit). There’s huge room for improvement as discussed in this README.
Note here that you really don’t have to use Java or even SpringBoot - since we’ll be packaging this as a container in the next step we really could have used any combination of language and framework here.
The one nice thing about Java is its vibrant ecosystem and in my case this means using Jib, an open source tool to build and push an optimized container image without having to write a Dockerfile or even install docker. In my case I used the Maven plugin and executed
mvn compile jib:build (gradle is also supported). This Dockerfile-less approach to building containers is available for many programming languages through Buildpacks.
The key thing here though is to make sure the containerized app will indeed be able to make the call to the
pdfunite binary. There is no way with Jib to add linux packages so I replaced the default Jib base image in my
pom.xml with a simple image based on
openjdk:11-jre-slim and added the
pdfunite binary (
apt-get install poppler-utils). You actually do not need to build this image and can instead use the one I made available.
Once your app is packaged into a container, the image is pushed (as part of Jib’s build process) to the Google Cloud Registry for easier and faster deployments with this tag:
Cloud Run is Google’s fully-managed solution that automatically scales stateless containers with a pay-per-use model. It is remarkably simple to use and a good fit to expose this small Java web application to the world or within your organization.
With our image being already available from Cloud Registry we can simply deploy a new service with
$ gcloud run deploy --image gcr.io/PROJECT-ID/pdfmerger
You can provide options to specify the service name, the use of the managed platform (vs. hosting on a Kubernetes cluster), the deployment region. You can allow unauthenticated requests if you’d like to make this accessible as a public website or restrict its access via Cloud IAM (if this is meant to be a service accessible to other apps in your organization). Here are the relevant options when using
Once deployment completes (which should be fairly quick), Cloud Run gives you a URL :
The 40 MB file size limitation seemed reasonable and is enforced using SpringBoot’s
application.properties. I’ve also bumped up the memory setting for Cloud Run to 2 GB.
Cloud Run has concurrency built-in, which means that multiple requests can hit the same container instance. This is a great way to better utilize resources but also to limit cold starts. It however means that you need to build your service with concurrency in mind; or always use concurrent APIs and services.
In our case, this means that concurrent users (all those using a given container instance) share the same filesystem and it is important to keep that in mind to avoid mixing users files in any way. For this app, I addressed this by grouping files by user with a common prefix and preserving their order in a List.
You can quickly deploy your own copy of this webapp using the Cloud Run button (linked off of the repo's
README). This will encapsulate all the steps discussed so far and present you with a URL with the deployed service.
As it stands there are still a few limitations with this web app but hopefully it has shown you one way that you can make a popular command-line tool available via a serverlessly-hosted web service.
It’s not just about PDF! You can adapt this to use ImageMagick (to transform images), ffmpeg (e.g. to trim videos), Inkscape, or any other OS program that can be containerized. If you’re looking to generate a PDF from productivity suite formats, then consider using this similar example which is using LibreOffice.