MartinJ

Posted on Sep 14, 2022 • Edited on Mar 3

Searching PDF files - GCSE and "local-code" options

#react #webapp #javascript

Last reviewed: March '26

1. Introduction

If you are responsible for a large collection of PDF files, few things are more frustrating than knowing that, somewhere in your archive, there is a document that contains exactly the information you are looking for, but there's no easy way of finding it. If you were looking for a message in your email folders, a service such as Gmail's folder search would provide an answer in seconds. How can you provide a similar arrangement for PDF files? Here are a few ideas.

2. GCSE - Google's Custom Search Engine

If your PDF files are already on the web, the GCSE makes the task of adding a search facility to a webapp a five-minute job.

The first step is to upload your PDF files to a cloud location - typically a Google storage bucket. Then you use the Programmable Search Engine Homepage to construct a customised entry for your location in Google's GCSE system - see the "Create a search engine" section of Google's Getting started with Programmable Search Engine support page.

All you need here is a Google account. The instructions referenced above will then enable you to use the Google Programmable Search Engine Console to register a search engine for your site. In return, you receive a unique Search Engine Id.

To use your GCSE in a conventional JavaScript webapp, all you need is the following tiny packet of JavaScript:

<script src="https://cse.google.com/cse.js?cx=" + mySearchengineId></script>
<div class="gcse-search"></div>

The effect of this will be to display a text-search input field and a search icon button.

Submitting a search specification will typically present the results in a pop-up window (depending on your chosen GCSE layout).

This arrangement worked flawlessly for me in my first simple website - the only time I ever encountered a problem was when <td> and <th> styles in my webapp's stylesheets collided with Google's use of these elements in its script. This was easily fixed by qualifying my styles with a classname.

But I was initially stumped when I wanted to use a CSE in a React webapp.

Eventually I hit on a sandbox registered by khrismuc at React cascading select.

This uses a React useEffect hook to reach into the DOM and invoke Google's cse script. All I had to do in my own case was to create a component as follows:

import React, { useEffect } from "react";

function MyGcseSearch() {
    useEffect(() => {
        const script = document.createElement("script");
        document.head.append(script);
        script.src = "https://cse.google.com/cse.js?cx=00111 ..obscured ...ihvlxu";
    }, []);

    return (
            <div className="gcse-search"></div>
    );
}

export { MyGcseSearch };

Then all I had to do was import the component into my webapp and render it as <MyGcseSearch/>

3. A "local-code" solution

The GCSE is a fine facility, but it may not work well for everybody. The biggest obstacle is that it requires your files to be indexed by Google. In my experience, Google can be very reluctant to index PDF files. Secondly, the cost of providing this service must be considerable, and Google recovers these costs by including adverts alongside search results. Although a free "no-ads" version is available for non-profit organisations, qualifying as such may prove difficult. In my experience, the ads are intrusive, and the presence of associated cookies would likely have required my site to display a GDPR pop-up. Frustrated, I started to look for other solutions.

Pulling text from a PDF file

A couple of fascinating mornings spent with ChatGPT threw up a useful set of ideas.

The best suggestion was that I should look at a pdftotextcommand-line tool that converts PDF files to plain text files with a command such as:

pdftotext document.pdf output.txt

The pdftotext command is available automatically in the Bourne shell, but I'm a Windows developer focused on PowerShell. This meant that I had to download a Windows version of the library as a zip file from a source at https://github.com/oschwartz10612/poppler-windows/releases.

The zip file was then extracted to C:\Tools\poppler, and I added the following to my system PATH:

C:\Tools\poppler\Library\bin

I could now use pdftotext in PowerShell, and was soon looking at a text version of one of my PDF files. The tool was lightning fast!

Caching the pulled text

It quickly became clear that the recovered text would need to be held in some sort of permanent cache, readily accessible to the webapp's search routines. The obvious solution was to read all the files in my PDF archive, use pdftotext to convert each one into a text string, and then combine these into a single massive JSON file. An online text search would filter the JSON on the supplied keywords and display the results.

Here are a few lines from the start of one of my JSON files (since I had several different archives, it made sense to give each one its own JSON)

[
{"filename":"1998-2.pdf","text":"[...] Editorial [...]  CONTENTS : [...]  Appleby Archaeology Group came ?…? order to ta ............. "}
,
{"filename":"1998-3.pdf","text":"[...]  Appleby ?…? Archaeology [...]  UPDATE ?…? Summer 98 ?…? this pleases more memb ............" }
,

The sructure of these will make more sense when you've seen the sort of output I was aiming to generate. Here's the output from the search routine in my apparch.org website at Archives/Newsletters.

This screenshot shows the results of searching Appleby Archaeology's newsletter archive for the keyword "Mayburgh". Each line displays a link to a newsletter that contains at least one reference to the keyword. The line mainly consists of the text immediately surrounding the first instance of the keyword in the newsletter. Formatting and other strange characters in the original text are replaced by standardised "blob" sequences represented as [[...]]. The first part of each line displays a clickable link generated from the "filename" property of the JSON line for the newsletter, and the total number of matches is displayed at the end of the line. The total set of matched newsletters is sorted in match-count sequence so that strongly-matched newsletters are given prominence.

Developing a script to maintain the cache

Once all this had been decided, the detailed design of a PowerShell script that would generate a JSON became clear:

It would be parameterised so that JSONS for all my archive classes could be served by a single PowerShell script that accepted a SourceFolder parameter
The content of the supplied SourceFolder (in my case, a Google storage bucket) would be downloaded (using gsutil in my case ) into a local PdfRoot project folder. For efficiency, remote files should only be downloaded when they are more recent than the current local copy.
pdftotext would then be applied to each file in PdfRoot to create a corresponding PdfText folder
Finally, a JSON would be constructed containing a line for each file in PdfText.

The first issue to consider was where the JSONs would be held and how they would be read. My project uses React and is deployed to Google's Firebase, so it seemed obvious that the JSON should be stored in React's public folder. Here they can be read by a simple FETCH command:

   async function ensureIndexLoaded() {
        if (indexRef.current) return indexRef.current; // don't download anything if you've already done this

        // firebase.json sets a max-age header on the downloaded_texts.json file of 3 months so once
        // downloaded you can use it for 3 months from cache without refresh. Note that for a first-time user
        // of the webapp, the impact of the extra load will /only/ be felt when the user actually runs a
        // newsletter search and the source file is explicitly requested.
        const res = await fetch("/search_jsons/" + archiveTarget + ".json");
        if (!res.ok) throw new Error(`Failed to load index: ${res.status}`);
        indexRef.current = await res.json();
        return indexRef.current;
    }

The beauty of using the public folder to hold the JSON source file is that the fetch will only be executed when it is actually referenced by a user electing to search. Even then, the request is likely to be served from the browser cache.

From here, it was but a few short hops to creating a universal search button component in my React webapp.

Up to this point, there had been no confidence that Javascript seaching large block of browser storage for a keyword would deliver acceptable performance. Judged against industry standards, my archives are modest. For example, AppArch's newsletters archive contains only 100 or so newsletters - enough to make a manual search impractical, but only occupying 2MB when represented as a JSON. Still, it was a great relief when I found that search responses were virtually instantaneous.

Dealing with "scanned" PDF files

Once things settled down, it became obvious that there was a snag. Most of AppArch's PDF files had been generated from word-processor source. These seemed generally fine. But a few had been generated by scanning old paper copies. Searches for content here yielded a null response: pdftotext was unable to "see" text in the scanned image.

The answer to this was to enhance the JSON-generation script to recognise files with no "meaningful" content and fall back to OCR (Optical Character Recognition) technologies.

With Python installed on my PC, I was eventually able to extract text from PDF files requiring OCR conversion with a:

ocrmypdf ocr_document.pdf output.txt

But getting to that point was tricky. First of all, I had to:
install Mannheim tesseract ocr from https://github.com/UB-
and then Ghostscript at https://www.ghostscript.com/releases/gsdnld.htmlMannheim/tesseract/wiki. Only then was I able to install ocrmypdf with a pip install ocrmypdf command

With all this in place, I was able to enhance the JSON-creation script to deal with my scanned files. Obviously, this now ran much more slowly, but I wasn't concerned as the script was now configured to run overnight under the Windows scheduler.

Sample code for a "local-code" solution

The code for my "Generalised Archive Search Box" component and the Search-Json PowerShell script that services it can be viewed on GitHub at pdftest repo.

I hope you have find all of this useful!