How to Access the PDF DOM in C# (.NET Tip)

PDFs have an internal structure—text objects, images, paths, annotations. Accessing this DOM (Document Object Model) lets you inspect, modify, or extract specific elements programmatically.

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("document.pdf");
var page = pdf.Pages[0];
var dom = page.ObjectModel;

foreach (var obj in dom.GetAllObjects())
{
    Console.WriteLine($"Type: {obj.GetType().Name}");
}

The ObjectModel exposes the raw PDF structure—text, images, and paths as individual objects.

What Is the PDF DOM?

Unlike HTML, PDF doesn't have a visible DOM in browsers. But internally, every PDF is built from objects:

TextObject - Individual text runs with position, font, and content
ImageObject - Embedded images with dimensions and data
PathObject - Vector graphics, lines, shapes
AnnotationObject - Comments, highlights, form fields

Accessing these objects lets you:

Extract specific text elements (not just raw text dump)
Find and replace content programmatically
Analyze document structure
Modify individual elements

How Do I Access Page Objects?

Navigate from document to page to objects:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("report.pdf");

// Access first page
var firstPage = pdf.Pages[0];
var pageObjects = firstPage.ObjectModel;

// Iterate all objects on the page
foreach (var obj in pageObjects.GetAllObjects())
{
    switch (obj)
    {
        case TextObject text:
            Console.WriteLine($"Text: {text.Text}");
            break;
        case ImageObject image:
            Console.WriteLine($"Image: {image.Width}x{image.Height}");
            break;
        case PathObject path:
            Console.WriteLine("Vector path found");
            break;
    }
}

Each page has its own ObjectModel with that page's content.

How Do I Extract Text Objects?

Get text with position information:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("contract.pdf");
var page = pdf.Pages[0];

var textObjects = page.ObjectModel.GetTextObjects();

foreach (var text in textObjects)
{
    Console.WriteLine($"Text: '{text.Text}'");
    Console.WriteLine($"Position: ({text.X}, {text.Y})");
    Console.WriteLine($"Font: {text.FontName}, Size: {text.FontSize}");
    Console.WriteLine("---");
}

This gives you more than ExtractAllText()—you get positioning, fonts, and individual text runs.

How Do I Find Specific Text?

Search for text objects matching criteria:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("invoice.pdf");

foreach (var page in pdf.Pages)
{
    var textObjects = page.ObjectModel.GetTextObjects();

    // Find all currency amounts
    var amounts = textObjects
        .Where(t => t.Text.StartsWith("$") ||
                    System.Text.RegularExpressions.Regex.IsMatch(t.Text, @"\d+\.\d{2}"))
        .ToList();

    foreach (var amount in amounts)
    {
        Console.WriteLine($"Found amount: {amount.Text} at ({amount.X}, {amount.Y})");
    }
}

Combine text position with content to locate specific data.

How Do I Extract Images from the DOM?

Access embedded images:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("brochure.pdf");

int imageCount = 0;
foreach (var page in pdf.Pages)
{
    var images = page.ObjectModel.GetImageObjects();

    foreach (var image in images)
    {
        Console.WriteLine($"Image {++imageCount}:");
        Console.WriteLine($"  Size: {image.Width}x{image.Height}");
        Console.WriteLine($"  Position: ({image.X}, {image.Y})");

        // Export the image
        var bytes = image.GetImageData();
        File.WriteAllBytes($"image-{imageCount}.png", bytes);
    }
}

Extract images while preserving position information.

How Do I Analyze Document Structure?

Understand what a PDF contains:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("document.pdf");

var analysis = new
{
    PageCount = pdf.PageCount,
    Pages = pdf.Pages.Select((page, index) => new
    {
        PageNumber = index + 1,
        TextObjects = page.ObjectModel.GetTextObjects().Count(),
        ImageObjects = page.ObjectModel.GetImageObjects().Count(),
        PathObjects = page.ObjectModel.GetPathObjects().Count()
    }).ToList()
};

Console.WriteLine($"Document has {analysis.PageCount} pages");
foreach (var page in analysis.Pages)
{
    Console.WriteLine($"Page {page.PageNumber}:");
    Console.WriteLine($"  Text objects: {page.TextObjects}");
    Console.WriteLine($"  Images: {page.ImageObjects}");
    Console.WriteLine($"  Paths: {page.PathObjects}");
}

This helps understand document complexity before processing.

How Do I Find Text by Position?

Locate text in specific regions:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("form.pdf");
var page = pdf.Pages[0];

// Define region of interest (e.g., top-right header area)
var regionX = 400;
var regionY = 700;
var regionWidth = 200;
var regionHeight = 100;

var textInRegion = page.ObjectModel.GetTextObjects()
    .Where(t =>
        t.X >= regionX &&
        t.X <= regionX + regionWidth &&
        t.Y >= regionY &&
        t.Y <= regionY + regionHeight)
    .ToList();

Console.WriteLine($"Text in region: {string.Join(" ", textInRegion.Select(t => t.Text))}");

Useful for extracting data from forms with known layouts.

How Do I Work with Vector Paths?

Access drawing instructions:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("diagram.pdf");
var page = pdf.Pages[0];

var paths = page.ObjectModel.GetPathObjects();

foreach (var path in paths)
{
    Console.WriteLine($"Path with {path.Operations.Count} operations");
    Console.WriteLine($"  Fill color: {path.FillColor}");
    Console.WriteLine($"  Stroke color: {path.StrokeColor}");
    Console.WriteLine($"  Stroke width: {path.StrokeWidth}");
}

Paths represent lines, rectangles, curves, and complex shapes.

How Do I Modify Object Properties?

Change object attributes (experimental):

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("template.pdf");
var page = pdf.Pages[0];

// Find and modify specific text
var textObjects = page.ObjectModel.GetTextObjects();
var placeholder = textObjects.FirstOrDefault(t => t.Text == "{{NAME}}");

if (placeholder != null)
{
    // Note: Modification capabilities vary by PDF structure
    // Some changes may require regenerating the page
    Console.WriteLine($"Found placeholder at ({placeholder.X}, {placeholder.Y})");
}

pdf.SaveAs("modified.pdf");

Direct DOM modification is complex due to PDF's internal structure. For text replacement, use ReplaceText() methods instead.

What Are the Limitations?

PDF DOM access has constraints:

Structure varies by generator - PDFs from different tools have different internal structures
Text may be fragmented - A "sentence" might be multiple TextObjects
Coordinates are PDF units - Origin is bottom-left, not top-left
Modification is limited - Changing objects may break PDF structure
Feature is experimental - API may change in future versions

For most use cases, higher-level methods (ExtractAllText(), ReplaceText()) are more reliable.

When Should I Use DOM Access?

Good use cases:

Analyzing document structure
Extracting text with position data
Finding images with their locations
Building document inspection tools
Understanding PDF content for debugging

Better alternatives:

Simple text extraction → ExtractAllText()
Find and replace → ReplaceText()
Image extraction → ExtractAllImages()
Adding content → Use stamps, watermarks, or HTML generation

Complete Example: Document Analyzer

using IronPdf;
// Install via NuGet: Install-Package IronPdf

public class PdfAnalyzer
{
    public DocumentAnalysis Analyze(string pdfPath)
    {
        var pdf = PdfDocument.FromFile(pdfPath);
        var analysis = new DocumentAnalysis
        {
            PageCount = pdf.PageCount,
            Pages = new List<PageAnalysis>()
        };

        foreach (var page in pdf.Pages)
        {
            var textObjects = page.ObjectModel.GetTextObjects().ToList();
            var imageObjects = page.ObjectModel.GetImageObjects().ToList();

            analysis.Pages.Add(new PageAnalysis
            {
                TextCount = textObjects.Count,
                ImageCount = imageObjects.Count,
                UniqueFont = textObjects.Select(t => t.FontName).Distinct().Count(),
                TotalTextLength = textObjects.Sum(t => t.Text?.Length ?? 0)
            });
        }

        return analysis;
    }
}

public class DocumentAnalysis
{
    public int PageCount { get; set; }
    public List<PageAnalysis> Pages { get; set; }
}

public class PageAnalysis
{
    public int TextCount { get; set; }
    public int ImageCount { get; set; }
    public int UniqueFont { get; set; }
    public int TotalTextLength { get; set; }
}

PDF DOM access opens the document's internal structure. Use it for inspection and analysis, but prefer higher-level APIs for modification.

For more details on DOM access, see the IronPDF DOM documentation.

Written by Jacob Mellor, CTO at Iron Software. Jacob created IronPDF and leads a team of 50+ engineers building .NET document processing libraries._