Programmatically Search and Highlight Text in PDFs using C# in .NET

#webdev #devops #dotnet #csharp

What You Will Need

Visual Studio Code
.NET 8
NuGet package: DS.Documents.Pdf 7.0.3

Controls Referenced

Tutorial Concept
This tutorial discusses programmatically conducting text searches and highlighting found text in PDFs using a C#/.NET PDF API.

This tutorial delves into different ways to programmatically search, find, and highlight text within PDF documents using .NET/C# API. We will go over loading a PDF, conducting text searches, and creating highlight markups with nuanced colors and shapes. In this example, we will use Document Solutions for PDF (DsPdf, formerly GcPdf), which enables seamless integration for C#/.NET software developers seeking advanced PDF generation functionalities. This piece will showcase the generated PDFs using the included JavaScript Document Solutions PDF Viewer.

This blog will cover how to conduct the following PDF text searches programmatically using a C# .NET PDF API:

Find and Highlight Text in a PDF Documents
Search for Text on a Specific PDF Page
Find and Highlight Text From a Specific Range of PDF Pages
Search for Text in a PDF Based on Structure Tags
Find and Markup Transformed Text in PDFs

To Follow Along, Download a Sample App for this Tutorial Here.

Find and Highlight Text in a PDF Document Using C

DsPdf simplifies conducting programmatic text searches in PDF documents through its FindText method, enabling users to locate all instances of specific text. The highlighting of each found item can be achieved using the System.Drawing graphics class along with the bounds of the identified text. Users can customize text search parameters through the FindTextParams constructor, with options such as wholeWordand matchCase. These parameters provide flexibility, allowing users to determine whether the search should match whole words, be case-sensitive, or both.

Note: To follow along with this section, you must include the GrapeCity.Documents.Common namespace.

The following code will search for the whole word "wetlands" in a PDF and then highlight the found text:

    // Initialize the DsPdf document instance
    var doc = new GcPdfDocument();

    using (var fs = new FileStream(Path.Combine("wetlands.pdf"),FileMode.Open, FileAccess.Read))
    {
       // Load a sample PDF  
       doc.Load(fs);
       // Use the FindText method to search text for drive, using case-insensitive, whole word match  
       var findsDrive = doc.FindText(new FindTextParams("wetlands", true, false), OutputRange.All);

       // Highlight all found text using semi-transparent orange red  
       foreach (var find in findsDrive)doc.Pages[find.PageIndex].Graphics.FillPolygon(find.Bounds[0], Color.FromArgb(100, Color.OrangeRed));

       doc.Save("1 - Search and Highlight Text.pdf");
    }

Developers can do a multitude of searches and apply different types of markups. See our online documentation and demo explorer to learn more.

Search for Text on a Specific PDF Page using C

In specific scenarios, users might opt to narrow down text searches to a particular page rather than scanning the entire PDF document. This can be achieved by accessing the text map interface of a specific page using its index and conducting a text search exclusively within that page's text map. For instance, the provided code demonstrates the following steps: instantiating a new FindTextParams class and performing a text search within the Text Map using the FindText method.

The following code demonstrates this by searching and highlighting the word “the” on the 2nd page of the PDF document.

            // Create new instance of PDF document  
            GcPdfDocument doc = new GcPdfDocument();
            using (var fs = new FileStream(Path.Combine("wetlands.pdf"), FileMode.Open, FileAccess.Read))
            {
               // Load existing PDF  
                doc.Load(fs);
                // 1\. Create a new instance of FindTextParams
                var ftp = new FindTextParams("the", true, false);
                // 2\. Get the text map of a page by its index, not index starts at 0 so this will search page 2  
                var tm = doc.Pages[1].GetTextMap();
                if (tm != null)
                    // 3\. Perform text search within the text map using FindText Method and highlight text orange                    
                    tm.FindText(ftp, (p_) => {
                        doc.Pages[1].Graphics.FillPolygon(p_.Bounds[0], Color.FromArgb(100, Color.OrangeRed));
                    });
                doc.Save("2 - Search Text Only Page 2.pdf");
            }

Find and Highlight Text From a Specific Range of PDF Pages Using C

Searching for text within a specific page range in a PDF is crucial for focused analysis. This targeted approach improves performance and isolates content for detailed examination. Developers can conduct this text search programmatically easily by defining the OutputRange class of the FindText methods. The OutputRange class provides the searchRange property.

Note: To follow along with this section, you must include the GrapeCity.Documents.Common namespace.

The code below will search and highlight text only on pages 2 and 3 of the provided PDF document.

     // Initialize the DsPdf document instance
     var doc = new GcPdfDocument();
     using (var fs = new FileStream(Path.Combine("wetlands.pdf"),
           FileMode.Open, FileAccess.Read))
     {
         // Load an existing document from file stream  
         doc.Load(fs);
         // Create an new FindTextParams instance  
         var ftp = new FindTextParams("the", true, false);
         // Define to and from page range properties  
         OutputRange searchRange = new OutputRange(2, 3);
         // Find all text using case-insensitive word search within the page range  
         var findsTextThe = doc.FindText(ftp, searchRange);

         foreach (var find in findsTextThe)
             doc.Pages[find.PageIndex].Graphics.FillPolygon
            (find.Bounds[0], Color.FromArgb(100, Color.OrangeRed));
         doc.Save("3 - Find and Highlight Text From a Specific Range of PDF Pages.pdf");
     }

Search for Text in a PDF Based on Structure Tags

Searching for text based on structural tags offers an alternative method for specifying parameters in a text search. For instance, to locate headers like H1, H2, or H3, users can employ the GetLogicalStructure method to retrieve the PDF document's structure. By specifying the desired tag item, such as "H1," users can initiate a process to obtain the PDF structure, searching the page root for the specified structural tag and iteratively navigating through the located tags to highlight the tag containing the desired text.

Note: To follow along with this section, you must include the GrapeCity.Documents.Pdf.Recognition.Structure namespace.

The following code will get the PDF’s H1 tags and search through them for the text “C1Olap”.

     GcPdfDocument doc = new GcPdfDocument();
     using (var fs = new FileStream(Path.Combine("read-tags-to-outlines.pdf"), FileMode.Open, FileAccess.Read))
     {
         doc.Load(fs);
         // Get the LogicalStructure of the doc
         LogicalStructure ls = doc.GetLogicalStructure();
         if (ls == null || ls.Elements == null || ls.Elements.Count == 0)
         {
             // No structure tags found:
             Console.Write("No structure tags were found in the source document.", doc.Pages.Add());
             return;
         }

         // Element holds a reference of the logical structure
         Element root = ls.Elements[0];
         // Find all the H1 tags
         var find = root.Children.ToList().FindAll(e_ => e_.StructElement.Type == "H1");
         //  Loop through all found H1 tags for specific text  
         foreach (Element e in find)
         {
             var color = Color.FromArgb(64, Color.Red);
             if (e.HasContentItems)
             {
                 // Get headers text  
                 var text = e.GetText();
                 foreach (var i in e.ContentItems)
                 {
                     // Search for title with text "C1Olap"   
                     if (text.Contains("C1Olap", StringComparison.OrdinalIgnoreCase))
                     {
                         if (i is ContentItem ci)
                         {
                             var p = ci.GetParagraph();
                             if (p != null)
                             {
                                 // Get the coordinates of the found H1 tag  
                                 var rc = p.GetCoords().ToRect();
                                 rc.Offset(rc.Width, 0);
                                 // Draws highlighting around found H1  
                                 ci.Page.Graphics.DrawPolygon(p.GetCoords(), color, 1, null);
                             }
                         }
                     }
                 }
             }
             else
                 Console.WriteLine("No Text Found");
         }
         doc.Save("4 - Search for Text in a PDF Based on Structure Tags.pdf");
         Console.WriteLine("PDF saved");
     }

To learn more about reading PDF structure tags using C#, check out the online Read Structure Tags Demo.

Find and Markup Graphically Transformed Text in PDFs

PDFs are known to contain graphically transformed text; drawing text on top of an existing PDF using page graphics. This is typical when adding a logo or watermark to a PDF. DsPdf supports the ability to search for text specifically within graphically transformed text and highlight the found items.

To accomplish this, use DsPdf's FindText method to search for the wanted text.

Then, loop through each page containing the searched text and create a content stream using DsPdf's ContentStreams property. With this stream, get the graphics on the page using the GetGraphics method and apply the highlighting to the bounds of the found text from the returned graphics.

The provided code snippet conducts a search within a PDF document to identify graphically transformed text acting as a logo watermark for specified text, then highlighting the found instances with blue rectangles.

            // Initialize the DsPdf document instance
            var doc = new GcPdfDocument();

            using (var fs = new FileStream(Path.Combine("Transformed Text.pdf"), FileMode.Open, FileAccess.Read))
            {
                // Load an existing document from file stream  
                doc.Load(fs);
                // Find all text items 'LOGO', using case-sensitive search
                var finds = doc.FindText(new FindTextParams("LOGO", false, true), OutputRange.All);
                // Highlight all finds: first, find all pages where the text was found  
                var pgIndices = finds.Select(f_ => f_.PageIndex).Distinct();
                // Loop through pages with found text  
                foreach (int pgIdx in pgIndices)
                {
                    var page = doc.Pages[pgIdx];
                    // Create a content stream of the page  
                    PageContentStream pcs = page.ContentStreams.Insert(0);
                    // Get the graphics included on the a pages content stream  
                    var g = pcs.GetGraphics(page);
                    foreach (var find in finds.Where(f_ => f_.PageIndex == pgIdx))
                    {
                        foreach (var ql in find.Bounds)
                        {
                            // Set the color used to fill the polygon/highlight the found text  
                            g.FillPolygon(ql, Color.CadetBlue);
                            g.DrawPolygon(ql, Color.Blue);
                        }
                    }
                } 
                doc.Save("5 - Find and Markup Graphically Transformed Text in PDFs.pdf");
            }
            Console.WriteLine("PDF saved");

Try our online demo for Finding Transformed Text using a .NET PDF API to see another example.

Learn More About this .NET C# PDF API

This article scratches the surface of the full capabilities of Document Solutions for PDF. Learn how to create, extract, modify, redact, apply signatures, and more with this .NET C# PDF API. Document Solutions offers a full-fledged PDF solution, including a client-side JavaScript PDF viewer control. The JS PDF viewer control is showcased throughout this piece. To learn more about the .NET C# API and its JavaScript PDF viewer, check out our demos and documentation:

Document Solutions for PDF, .NET C# PDF API

Online Demo Explorer | Documentation

Document Solutions PDF Viewer, JavaScript PDF viewer control

Online Demo Explorer | Documentation