DEV Community

IronSoftware
IronSoftware

Posted on

How to Access the PDF DOM in C# (.NET Tip)

PDFs have an internal structure—text objects, images, paths, annotations. Accessing this DOM (Document Object Model) lets you inspect, modify, or extract specific elements programmatically.

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("document.pdf");
var page = pdf.Pages[0];
var dom = page.ObjectModel;

foreach (var obj in dom.GetAllObjects())
{
    Console.WriteLine($"Type: {obj.GetType().Name}");
}
Enter fullscreen mode Exit fullscreen mode

The ObjectModel exposes the raw PDF structure—text, images, and paths as individual objects.

What Is the PDF DOM?

Unlike HTML, PDF doesn't have a visible DOM in browsers. But internally, every PDF is built from objects:

  • TextObject - Individual text runs with position, font, and content
  • ImageObject - Embedded images with dimensions and data
  • PathObject - Vector graphics, lines, shapes
  • AnnotationObject - Comments, highlights, form fields

Accessing these objects lets you:

  • Extract specific text elements (not just raw text dump)
  • Find and replace content programmatically
  • Analyze document structure
  • Modify individual elements

How Do I Access Page Objects?

Navigate from document to page to objects:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("report.pdf");

// Access first page
var firstPage = pdf.Pages[0];
var pageObjects = firstPage.ObjectModel;

// Iterate all objects on the page
foreach (var obj in pageObjects.GetAllObjects())
{
    switch (obj)
    {
        case TextObject text:
            Console.WriteLine($"Text: {text.Text}");
            break;
        case ImageObject image:
            Console.WriteLine($"Image: {image.Width}x{image.Height}");
            break;
        case PathObject path:
            Console.WriteLine("Vector path found");
            break;
    }
}
Enter fullscreen mode Exit fullscreen mode

Each page has its own ObjectModel with that page's content.

How Do I Extract Text Objects?

Get text with position information:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("contract.pdf");
var page = pdf.Pages[0];

var textObjects = page.ObjectModel.GetTextObjects();

foreach (var text in textObjects)
{
    Console.WriteLine($"Text: '{text.Text}'");
    Console.WriteLine($"Position: ({text.X}, {text.Y})");
    Console.WriteLine($"Font: {text.FontName}, Size: {text.FontSize}");
    Console.WriteLine("---");
}
Enter fullscreen mode Exit fullscreen mode

This gives you more than ExtractAllText()—you get positioning, fonts, and individual text runs.

How Do I Find Specific Text?

Search for text objects matching criteria:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("invoice.pdf");

foreach (var page in pdf.Pages)
{
    var textObjects = page.ObjectModel.GetTextObjects();

    // Find all currency amounts
    var amounts = textObjects
        .Where(t => t.Text.StartsWith("$") ||
                    System.Text.RegularExpressions.Regex.IsMatch(t.Text, @"\d+\.\d{2}"))
        .ToList();

    foreach (var amount in amounts)
    {
        Console.WriteLine($"Found amount: {amount.Text} at ({amount.X}, {amount.Y})");
    }
}
Enter fullscreen mode Exit fullscreen mode

Combine text position with content to locate specific data.

How Do I Extract Images from the DOM?

Access embedded images:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("brochure.pdf");

int imageCount = 0;
foreach (var page in pdf.Pages)
{
    var images = page.ObjectModel.GetImageObjects();

    foreach (var image in images)
    {
        Console.WriteLine($"Image {++imageCount}:");
        Console.WriteLine($"  Size: {image.Width}x{image.Height}");
        Console.WriteLine($"  Position: ({image.X}, {image.Y})");

        // Export the image
        var bytes = image.GetImageData();
        File.WriteAllBytes($"image-{imageCount}.png", bytes);
    }
}
Enter fullscreen mode Exit fullscreen mode

Extract images while preserving position information.

How Do I Analyze Document Structure?

Understand what a PDF contains:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("document.pdf");

var analysis = new
{
    PageCount = pdf.PageCount,
    Pages = pdf.Pages.Select((page, index) => new
    {
        PageNumber = index + 1,
        TextObjects = page.ObjectModel.GetTextObjects().Count(),
        ImageObjects = page.ObjectModel.GetImageObjects().Count(),
        PathObjects = page.ObjectModel.GetPathObjects().Count()
    }).ToList()
};

Console.WriteLine($"Document has {analysis.PageCount} pages");
foreach (var page in analysis.Pages)
{
    Console.WriteLine($"Page {page.PageNumber}:");
    Console.WriteLine($"  Text objects: {page.TextObjects}");
    Console.WriteLine($"  Images: {page.ImageObjects}");
    Console.WriteLine($"  Paths: {page.PathObjects}");
}
Enter fullscreen mode Exit fullscreen mode

This helps understand document complexity before processing.

How Do I Find Text by Position?

Locate text in specific regions:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("form.pdf");
var page = pdf.Pages[0];

// Define region of interest (e.g., top-right header area)
var regionX = 400;
var regionY = 700;
var regionWidth = 200;
var regionHeight = 100;

var textInRegion = page.ObjectModel.GetTextObjects()
    .Where(t =>
        t.X >= regionX &&
        t.X <= regionX + regionWidth &&
        t.Y >= regionY &&
        t.Y <= regionY + regionHeight)
    .ToList();

Console.WriteLine($"Text in region: {string.Join(" ", textInRegion.Select(t => t.Text))}");
Enter fullscreen mode Exit fullscreen mode

Useful for extracting data from forms with known layouts.

How Do I Work with Vector Paths?

Access drawing instructions:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("diagram.pdf");
var page = pdf.Pages[0];

var paths = page.ObjectModel.GetPathObjects();

foreach (var path in paths)
{
    Console.WriteLine($"Path with {path.Operations.Count} operations");
    Console.WriteLine($"  Fill color: {path.FillColor}");
    Console.WriteLine($"  Stroke color: {path.StrokeColor}");
    Console.WriteLine($"  Stroke width: {path.StrokeWidth}");
}
Enter fullscreen mode Exit fullscreen mode

Paths represent lines, rectangles, curves, and complex shapes.

How Do I Modify Object Properties?

Change object attributes (experimental):

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("template.pdf");
var page = pdf.Pages[0];

// Find and modify specific text
var textObjects = page.ObjectModel.GetTextObjects();
var placeholder = textObjects.FirstOrDefault(t => t.Text == "{{NAME}}");

if (placeholder != null)
{
    // Note: Modification capabilities vary by PDF structure
    // Some changes may require regenerating the page
    Console.WriteLine($"Found placeholder at ({placeholder.X}, {placeholder.Y})");
}

pdf.SaveAs("modified.pdf");
Enter fullscreen mode Exit fullscreen mode

Direct DOM modification is complex due to PDF's internal structure. For text replacement, use ReplaceText() methods instead.

What Are the Limitations?

PDF DOM access has constraints:

  1. Structure varies by generator - PDFs from different tools have different internal structures
  2. Text may be fragmented - A "sentence" might be multiple TextObjects
  3. Coordinates are PDF units - Origin is bottom-left, not top-left
  4. Modification is limited - Changing objects may break PDF structure
  5. Feature is experimental - API may change in future versions

For most use cases, higher-level methods (ExtractAllText(), ReplaceText()) are more reliable.

When Should I Use DOM Access?

Good use cases:

  • Analyzing document structure
  • Extracting text with position data
  • Finding images with their locations
  • Building document inspection tools
  • Understanding PDF content for debugging

Better alternatives:

  • Simple text extraction → ExtractAllText()
  • Find and replace → ReplaceText()
  • Image extraction → ExtractAllImages()
  • Adding content → Use stamps, watermarks, or HTML generation

Complete Example: Document Analyzer

using IronPdf;
// Install via NuGet: Install-Package IronPdf

public class PdfAnalyzer
{
    public DocumentAnalysis Analyze(string pdfPath)
    {
        var pdf = PdfDocument.FromFile(pdfPath);
        var analysis = new DocumentAnalysis
        {
            PageCount = pdf.PageCount,
            Pages = new List<PageAnalysis>()
        };

        foreach (var page in pdf.Pages)
        {
            var textObjects = page.ObjectModel.GetTextObjects().ToList();
            var imageObjects = page.ObjectModel.GetImageObjects().ToList();

            analysis.Pages.Add(new PageAnalysis
            {
                TextCount = textObjects.Count,
                ImageCount = imageObjects.Count,
                UniqueFont = textObjects.Select(t => t.FontName).Distinct().Count(),
                TotalTextLength = textObjects.Sum(t => t.Text?.Length ?? 0)
            });
        }

        return analysis;
    }
}

public class DocumentAnalysis
{
    public int PageCount { get; set; }
    public List<PageAnalysis> Pages { get; set; }
}

public class PageAnalysis
{
    public int TextCount { get; set; }
    public int ImageCount { get; set; }
    public int UniqueFont { get; set; }
    public int TotalTextLength { get; set; }
}
Enter fullscreen mode Exit fullscreen mode

PDF DOM access opens the document's internal structure. Use it for inspection and analysis, but prefer higher-level APIs for modification.

For more details on DOM access, see the IronPDF DOM documentation.


Written by Jacob Mellor, CTO at Iron Software. Jacob created IronPDF and leads a team of 50+ engineers building .NET document processing libraries._

Top comments (0)