Chelsea Devereaux for MESCIUS inc.

Posted on Jan 10 • Originally published at developer.mescius.com

Extract Table Data from PDF Documents in C#

#webdev #devops #csharp #tutorial

In today's data-driven world, seamlessly extracting structured table data from PDF documents has become a crucial task for developers. With GrapeCity Documents for PDF (GcPdf), you can effortlessly unlock the hidden treasures of information buried within those PDFs programmatically using C#.

Consider the popularity of PDFs, one of the most commonly used document formats, and the vast amount of data it can contain within its tables. Companies and organizations have been using PDF documents to review financial analyses, stock trends, contact details, and beyond. Now, picture situations like the examination of quarterly reports over multiple years, where the accumulation of data takes center stage.

Getting data from these reports may initially seem easy (copy/paste). Still, because of the structure of PDF files, it is rarely the case where a simple copy & paste will get tables worth of data without the need for significant manipulation and modifications.

Couple this with the possibility of copying and pasting from many other documents, and it is a recipe for a very long day (or even a week or more, depending on the data required!). Handling this type of requirement efficiently requires a tool that can automate this process, and the C# .NET GcPdf API Library is the perfect tool for the job!

This article is for developers who want to decrease the time it takes to collect data and improve the accuracy of the data-gathering process. The examples will help developers gain an understanding of the GcPdf tool in order to access Table(s) in PDF files and extract tabular data for export to CSV files or other formats, such as XLSX, as needed.

Important Information About Tables in PDF Document

Tables, much like PDF file formats, serve as a nearly prevalent means of data presentation. Nevertheless, it's essential to understand that a PDF document inherently lacks the concept of tables; in other words, the tables you see within a PDF are purely visual elements.

These PDF 'tables' differ from what we commonly encounter in applications like MS Excel or MS Word. Instead, they are constructed through a combination of operators responsible for rendering text and graphics in specific locations, resembling a tabular structure.

This means that the traditional notions of rows, columns, and cells are foreign to a PDF file, with no underlying code components to facilitate the identification of these elements. So, let's delve into how the GcPdf's C# API Library can help us achieve this task!

How to Extract Table Data from PDF Documents Programmatically Using C

Create a .NET Core Console Application with GcPdf Included
Load the Sample PDF that Contains a Data Table
Define Table Recognition Parameters
Get the Table Data
Save Extracted PDF Table Data to Another File Type (CSV)
Bonus: Format the Exported PDF Table Data in an Excel (XLSX) File

Be sure to download the sample applicationand try the detailed implementation of the use case scenario and code snippets described in this blog piece.

Create a .NET Core Console Application with GcPdf Included

Create a .NET Core Console application, right-click 'Dependencies,' and select 'Manage NuGet Packages'. Under the 'Browse' tab, search for 'GrapeCity.Documents.Pdf' and click 'Install'.

While installing, you will receive a 'License Acceptance' dialog. Click 'I Accept' to continue.

In the Program file, import the following namespaces:

    using System.Text;
    using GrapeCity.Documents.Pdf;
    using GrapeCity.Documents.Pdf.Recognition;
    using System.Linq;

Load the Sample PDF that Contains a Data Table

Create a new PDF document by initializing the GcPdfDocument constructor to load the PDF document that will be parsed. Invoke GcPdfDocument's Load method to load the original PDF document that contains a data table.

    using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
    {
        // Initialize GcPdf
        var pdfDoc= new GcPdfDocument();
        // Load a PDF document
        pdfDoc.Load(fs);
    }

In this example, we will use this PDF:

Define Table Recognition Parameters

Instantiate a new instance of the RectangleF class and define the table bounds in the PDF document.

    const float DPI = 72;
    using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "zugferd-invoice.pdf")))
    {
        // Initialize GcPdf
        var pdfDoc= new GcPdfDocument();
        // Load a PDF document
        pdfDoc.Load(fs);

        // The approx table bounds:
        var tableBounds = new RectangleF(0, 2.5f * DPI, 8.5f * DPI, 3.75f * DPI);
    }

To help table recognition within the defined parameters, we use the TableExtractOptions class, allowing us to fine-tune table recognition, accounting for idiosyncrasies of table formatting. TableExtractOptions is a parameter to specify table formatting options like column width, row height, and distance between rows or columns.

    // TableExtractOptions allow to fine-tune table recognition accounting for
    // specifics of the table formatting:
    var tableExtrctOpt = new TableExtractOptions();
    var GetMinimumDistanceBetweenRows = tableExtrctOpt.GetMinimumDistanceBetweenRows;

    // In this particular case, we slightly increase the minimum distance between rows
    // to make sure cells with wrapped text are not mistaken for two cells:
    tableExtrctOpt.GetMinimumDistanceBetweenRows = (list) =>
    {
        var res = GetMinimumDistanceBetweenRows(list);
        return res * 1.2f;
    };

Get the PDF’s Table Data

Create a list to hold table data from the PDF pages.

    // CSV: list to keep table data from all pages:
    var data = new List<List<string>>();

Invoke the GetTable method with the defined table bounds (defined in the previous step) to make the GcPdf search for a table inside the specified rectangle.

    for (int i = 0; i < pdfDoc.Pages.Count; ++i)
    {
      // Get the table at the specified bounds:
      var itable = pdfDoc.Pages[i].GetTable(tableBounds, tableExtrctOpt);
    }

Access each cell in the table using ITable.GetCell(rowIndex, colIndex) method. Use the Rows.Count and Cols.Count properties to loop through the extracted table cells.

    for (int i = 0; i < pdfDoc.Pages.Count; ++i)
    {
      // Get the table at the specified bounds:
      var itable = pdfDoc.Pages[i].GetTable(tableBounds, tableExtrctOpt);
      if (itable != null)
      {
        for (int row = 0; row < itable.Rows.Count; ++row)
        {
          // CSV: add next data row ignoring headers:
          if (row > 0)
            data.Add(new List<string>());

          for (int col = 0; col < itable.Cols.Count; ++col)
          {
            var cell = itable.GetCell(row, col);
            if (cell == null && row > 0)
              data.Last().Add("");
            else
            {
              if (cell != null && row > 0)
                data.Last().Add($"\"{cell.Text}\"");
           }
          }
        }
      }
    }

Save Extracted PDF Table Data to Another File Type (CSV)

For this step, we must first add a reference to the 'System.Text.Encoding.CodePages' NuGet package reference.

Then, to save the extracted PDF Table data from the variable in the previous step, we will use the File class and invoke its AppendAllLines method.

    for (int i = 0; i < pdfDoc.Pages.Count; ++i)
    {
       // Get the table at the specified bounds:
        var itable = pdfDoc.Pages[i].GetTable(tableBounds, tableExtrctOpt);
        if (itable != null)
        {
          for (int row = 0; row < itable.Rows.Count; ++row)
          {
           // CSV: add next data row ignoring headers:
            if (row > 0)
               data.Add(new List<string>());
            for (int col = 0; col < itable.Cols.Count; ++col)
            {
              var cell = itable.GetCell(row, col);
              if (cell == null && row > 0)
                data.Last().Add("");
              else
              {
               if (cell != null && row > 0)
                data.Last().Add($"\"{cell.Text}\"");
              }
            } 
         }
      }
    }

The data will now be available in a CSV file:

Original PDF

Extracted PDF Table Data in CSV File

Bonus: Format the Exported PDF Table Data in an Excel (XLSX) File

Although the data is now available in a format that can be easily read and manipulated, it is saved in a raw format in a CSV file format. To better utilize the data and make analysis more accessible, use the GrapeCity Documents for Excel (GcExcel) .NET edition and C# to load the CSV file into an Excel (XLSX) file and apply styling and formatting to the extracted data.

To use GcExcel, add the NuGet package 'GrapeCity.Documents.Excel' to the project and add its namespace.

    using GrapeCity.Documents.Excel;

Initialize a GcExcel workbook instance and load the CSV file using the Open method.

     var workbook = new GrapeCity.Documents.Excel.Workbook();
     workbook.Open($@"{fileName}", OpenFileFormat.Csv);

Get the range of the extracted data and wrap the cell range, apply auto-sizing to the columns, and apply styling with conditional back colors.

    IWorksheet worksheet = workbook.Worksheets[0];
    IRange range = worksheet.Range["A2:E10"];

    // wrapping cell content
    range.WrapText = true;

    // styling column names 
    worksheet.Range["A1"].EntireRow.Font.Bold = true;

    // auto-sizing range
    worksheet.Range["A1:E10"].AutoFit();

    // aligning cell content
    worksheet.Range["A1:E10"].HorizontalAlignment = HorizontalAlignment.Center;
    worksheet.Range["A1:E10"].VerticalAlignment = VerticalAlignment.Center;

    // applying conditional format on UnitPrice
    IColorScale twoColorScaleRule = worksheet.Range["E2:E10"].FormatConditions.AddColorScale(ColorScaleType.TwoColorScale);

    twoColorScaleRule.ColorScaleCriteria[0].Type = ConditionValueTypes.LowestValue;
    twoColorScaleRule.ColorScaleCriteria[0].FormatColor.Color = Color.FromArgb(255, 229, 229);

    twoColorScaleRule.ColorScaleCriteria[1].Type = ConditionValueTypes.HighestValue;
    twoColorScaleRule.ColorScaleCriteria[1].FormatColor.Color = Color.FromArgb(255, 20, 20);

    Thread.Sleep(1000);

Lastly, save the workbook as an Excel file using the Save method:

    workbook.Save("ExtractedData_Formatted.xlsx");

As you have seen, using C# and GcPdf, developers can programmatically extract PDF table data to another file (like a CSV), then using GcExcel, the data can be converted to a stylized and formatted Excel XLSX file for easy data analysis:

Original PDF

Extracted PDF Table Data in CSV File

Formatted Excel XLSX File

GrapeCity’s .NET PDF API Library

This article only scratches the surface of the full capabilities of the GrapeCity Documents for PDF (GcPdf). Review our documentation to see the many available features and our demos to see the features in action with downloadable sample projects.

Integrating this .NET PDF server-side API into a desktop or web-based application allows developers to take total control of PDFs - generating documents with speed, memory efficiency, and no dependencies.

DEV Community

Extract Table Data from PDF Documents in C#

Important Information About Tables in PDF Document

How to Extract Table Data from PDF Documents Programmatically Using C

Create a .NET Core Console Application with GcPdf Included

Load the Sample PDF that Contains a Data Table

Define Table Recognition Parameters

Get the PDF’s Table Data

Save Extracted PDF Table Data to Another File Type (CSV)

Bonus: Format the Exported PDF Table Data in an Excel (XLSX) File

GrapeCity’s .NET PDF API Library

Top comments (0)

Read next

How I prefer to set up frontend projects and why

Enhancing JavaScript with Promise.withResolvers()

Getting Started with DataDog's APM: A Developer's Guide

Workshop: make your first AI app in a few clicks with Python+Ollama+llama3