loading...

Automatically OCR scanned PDFs in NixOS

jwoudenberg profile image Jasper Woudenberg Updated on ・3 min read

Luckily I'm receiving more and more letters by email these days, but I still get a fair amount of paper letters as well. These I scan and then throw away.

To make it so I can find these documents back when I need them I run optical character recognition (OCR) on them after scanning. I can then use pdfgrep to search for a keyword in a directory of PDFs. That's so much easier than coming up with an organization scheme for these documents and applying it!

Here is how it works: My scanner is able to upload scanned files to a directory on a small server I'm running. When the server notices a new PDF in this directory it runs optical character recognition on it and then moves it to a different directory containing all the PDFs I ever scanned.

My scanner is a Brother ADS-1700w. The server is the smallest Hetzner Cloud instance (CX11) and costs me 3 Euro's a month. I use the OCRmyPDF to run optical character recognition. The server is running NixOS and is deployed using morph. Finally, I'm using healthchecks.io to let me know when the setup breaks.

Below is the annotated Nix code that makes the whole thing work.

{ pkgs, ... }:

{
  # A systemd path unit. Path units can be used to start
  # other services when something happens on the file
  # system, like a file being created.
  systemd.paths.ocrmypdf = {
    enable = true;
    # Enable this unit automatically when the server starts.
    wantedBy = [ "multi-user.target" ];
    description = "Start ocr-ing when there's new work.";
    pathConfig = {
      # Activate when files appear in the /data/scans-to-ocr
      # directory. This is where our scanner should upload
      # scanned files!
      DirectoryNotEmpty = "/data/scans-to-ocr";
      # If /data/scans-to-ocr does not exist, create it.
      MakeDirectory = true;
    };
  };

  # The service that does the actual work of running OCR.
  systemd.services.ocrmypdf = {
    enable = true;
    description = "Run ocrmypdf in /data/scans-to-ocr.";
    serviceConfig = {
      # Explain to systemd that this service is a script,
      # not some long-running process it needs to keep
      # alive. If the script exits after it's done that's
      # fine, systemd will call it again if there's new
      # work!
      Type = "oneshot";
      # Now we define what to run when this service gets
      # activated.
      #
      # We can pass `ExecStart` a single command to execute,
      # but the work we want to do does not fit in a single
      # command. Instead we let Nix create a shell script,
      # then tell systemd to run that script.
      ExecStart = let
        script = pkgs.writeShellScriptBin "go-ocr" ''
          #!/usr/bin/env bash

          # Run our OCR logic in turn for each scanned file.
          for file in /data/scans-to-ocr/*; do
            # generates a standard file name containing the 
            # current date and some random characters.
            output="$(mktemp -u "/tmp/$(date +%Y%m%d)_XXX.pdf")"

            # Run ocrmypdf on the scanned file.
            # --output-type   don't generate PDF/A's. This
            #                 might fail, requiring manual
            #                 intervention.
            # --rotate-pages  puts pages right-side-up
            # --skip-text     makes it so ocrmypdf skips
            #                 pages in the PDF that already
            #                 have text content, instead of
            #                 failing.
            # --language      which languages OCR should
            #                 detect. A lot (all?) languages
            #                 seem to be available by
            #                 default. Run
            #                 `tesseract --list-langs`
            #                 to find out which.
            ${pkgs.ocrmypdf}/bin/ocrmypdf \
              --output-type pdf \
              --rotate-pages \
              --skip-text \
              --language nld+eng \
              "$file" \
              "$output" \
              && rm "$file" \
              && mv "$output" /docs

            # Let healthchecks.io know whether ocrmypdf
            # succeeded or failed.
            ${pkgs.curl}/bin/curl --retry 3 \
              https://hc-ping.com/YOUR_UUID/$?
          done
        '';
      in "${script}/bin/go-ocr";
    };
  };

  # For healthchecks.io to mark a check as healthy it needs
  # to receive a periodic update, but we might not scan any
  # documents for days on end. The CRON job below will ping
  # healthchecks.io once an hour, but only if the
  # /data/scans-to-ocr directory is empty, indicating
  # ocrmypdf is doing work.
  services.cron.enable = true;
  services.cron.systemCronJobs = [
    ("0 * * * *      root    "
      + "ls -1qA /data/scans-to-ocr/ | grep -q . "
      + "|| curl -fsS -m 10 --retry 5 "
      + "-o /dev/null https://hc-ping.com/YOUR_UUID")
  ];

}
Enter fullscreen mode Exit fullscreen mode

Discussion

pic
Editor guide