PDFs have an internal structure—text objects, images, paths, annotations. Accessing this DOM (Document Object Model) lets you inspect, modify, or extract specific elements programmatically.
using IronPdf;
// Install via NuGet: Install-Package IronPdf
var pdf = PdfDocument.FromFile("document.pdf");
var page = pdf.Pages[0];
var dom = page.ObjectModel;
foreach (var obj in dom.GetAllObjects())
{
Console.WriteLine($"Type: {obj.GetType().Name}");
}
The ObjectModel exposes the raw PDF structure—text, images, and paths as individual objects.
What Is the PDF DOM?
Unlike HTML, PDF doesn't have a visible DOM in browsers. But internally, every PDF is built from objects:
- TextObject - Individual text runs with position, font, and content
- ImageObject - Embedded images with dimensions and data
- PathObject - Vector graphics, lines, shapes
- AnnotationObject - Comments, highlights, form fields
Accessing these objects lets you:
- Extract specific text elements (not just raw text dump)
- Find and replace content programmatically
- Analyze document structure
- Modify individual elements
How Do I Access Page Objects?
Navigate from document to page to objects:
using IronPdf;
// Install via NuGet: Install-Package IronPdf
var pdf = PdfDocument.FromFile("report.pdf");
// Access first page
var firstPage = pdf.Pages[0];
var pageObjects = firstPage.ObjectModel;
// Iterate all objects on the page
foreach (var obj in pageObjects.GetAllObjects())
{
switch (obj)
{
case TextObject text:
Console.WriteLine($"Text: {text.Text}");
break;
case ImageObject image:
Console.WriteLine($"Image: {image.Width}x{image.Height}");
break;
case PathObject path:
Console.WriteLine("Vector path found");
break;
}
}
Each page has its own ObjectModel with that page's content.
How Do I Extract Text Objects?
Get text with position information:
using IronPdf;
// Install via NuGet: Install-Package IronPdf
var pdf = PdfDocument.FromFile("contract.pdf");
var page = pdf.Pages[0];
var textObjects = page.ObjectModel.GetTextObjects();
foreach (var text in textObjects)
{
Console.WriteLine($"Text: '{text.Text}'");
Console.WriteLine($"Position: ({text.X}, {text.Y})");
Console.WriteLine($"Font: {text.FontName}, Size: {text.FontSize}");
Console.WriteLine("---");
}
This gives you more than ExtractAllText()—you get positioning, fonts, and individual text runs.
How Do I Find Specific Text?
Search for text objects matching criteria:
using IronPdf;
// Install via NuGet: Install-Package IronPdf
var pdf = PdfDocument.FromFile("invoice.pdf");
foreach (var page in pdf.Pages)
{
var textObjects = page.ObjectModel.GetTextObjects();
// Find all currency amounts
var amounts = textObjects
.Where(t => t.Text.StartsWith("$") ||
System.Text.RegularExpressions.Regex.IsMatch(t.Text, @"\d+\.\d{2}"))
.ToList();
foreach (var amount in amounts)
{
Console.WriteLine($"Found amount: {amount.Text} at ({amount.X}, {amount.Y})");
}
}
Combine text position with content to locate specific data.
How Do I Extract Images from the DOM?
Access embedded images:
using IronPdf;
// Install via NuGet: Install-Package IronPdf
var pdf = PdfDocument.FromFile("brochure.pdf");
int imageCount = 0;
foreach (var page in pdf.Pages)
{
var images = page.ObjectModel.GetImageObjects();
foreach (var image in images)
{
Console.WriteLine($"Image {++imageCount}:");
Console.WriteLine($" Size: {image.Width}x{image.Height}");
Console.WriteLine($" Position: ({image.X}, {image.Y})");
// Export the image
var bytes = image.GetImageData();
File.WriteAllBytes($"image-{imageCount}.png", bytes);
}
}
Extract images while preserving position information.
How Do I Analyze Document Structure?
Understand what a PDF contains:
using IronPdf;
// Install via NuGet: Install-Package IronPdf
var pdf = PdfDocument.FromFile("document.pdf");
var analysis = new
{
PageCount = pdf.PageCount,
Pages = pdf.Pages.Select((page, index) => new
{
PageNumber = index + 1,
TextObjects = page.ObjectModel.GetTextObjects().Count(),
ImageObjects = page.ObjectModel.GetImageObjects().Count(),
PathObjects = page.ObjectModel.GetPathObjects().Count()
}).ToList()
};
Console.WriteLine($"Document has {analysis.PageCount} pages");
foreach (var page in analysis.Pages)
{
Console.WriteLine($"Page {page.PageNumber}:");
Console.WriteLine($" Text objects: {page.TextObjects}");
Console.WriteLine($" Images: {page.ImageObjects}");
Console.WriteLine($" Paths: {page.PathObjects}");
}
This helps understand document complexity before processing.
How Do I Find Text by Position?
Locate text in specific regions:
using IronPdf;
// Install via NuGet: Install-Package IronPdf
var pdf = PdfDocument.FromFile("form.pdf");
var page = pdf.Pages[0];
// Define region of interest (e.g., top-right header area)
var regionX = 400;
var regionY = 700;
var regionWidth = 200;
var regionHeight = 100;
var textInRegion = page.ObjectModel.GetTextObjects()
.Where(t =>
t.X >= regionX &&
t.X <= regionX + regionWidth &&
t.Y >= regionY &&
t.Y <= regionY + regionHeight)
.ToList();
Console.WriteLine($"Text in region: {string.Join(" ", textInRegion.Select(t => t.Text))}");
Useful for extracting data from forms with known layouts.
How Do I Work with Vector Paths?
Access drawing instructions:
using IronPdf;
// Install via NuGet: Install-Package IronPdf
var pdf = PdfDocument.FromFile("diagram.pdf");
var page = pdf.Pages[0];
var paths = page.ObjectModel.GetPathObjects();
foreach (var path in paths)
{
Console.WriteLine($"Path with {path.Operations.Count} operations");
Console.WriteLine($" Fill color: {path.FillColor}");
Console.WriteLine($" Stroke color: {path.StrokeColor}");
Console.WriteLine($" Stroke width: {path.StrokeWidth}");
}
Paths represent lines, rectangles, curves, and complex shapes.
How Do I Modify Object Properties?
Change object attributes (experimental):
using IronPdf;
// Install via NuGet: Install-Package IronPdf
var pdf = PdfDocument.FromFile("template.pdf");
var page = pdf.Pages[0];
// Find and modify specific text
var textObjects = page.ObjectModel.GetTextObjects();
var placeholder = textObjects.FirstOrDefault(t => t.Text == "{{NAME}}");
if (placeholder != null)
{
// Note: Modification capabilities vary by PDF structure
// Some changes may require regenerating the page
Console.WriteLine($"Found placeholder at ({placeholder.X}, {placeholder.Y})");
}
pdf.SaveAs("modified.pdf");
Direct DOM modification is complex due to PDF's internal structure. For text replacement, use ReplaceText() methods instead.
What Are the Limitations?
PDF DOM access has constraints:
- Structure varies by generator - PDFs from different tools have different internal structures
- Text may be fragmented - A "sentence" might be multiple TextObjects
- Coordinates are PDF units - Origin is bottom-left, not top-left
- Modification is limited - Changing objects may break PDF structure
- Feature is experimental - API may change in future versions
For most use cases, higher-level methods (ExtractAllText(), ReplaceText()) are more reliable.
When Should I Use DOM Access?
Good use cases:
- Analyzing document structure
- Extracting text with position data
- Finding images with their locations
- Building document inspection tools
- Understanding PDF content for debugging
Better alternatives:
- Simple text extraction →
ExtractAllText() - Find and replace →
ReplaceText() - Image extraction →
ExtractAllImages() - Adding content → Use stamps, watermarks, or HTML generation
Complete Example: Document Analyzer
using IronPdf;
// Install via NuGet: Install-Package IronPdf
public class PdfAnalyzer
{
public DocumentAnalysis Analyze(string pdfPath)
{
var pdf = PdfDocument.FromFile(pdfPath);
var analysis = new DocumentAnalysis
{
PageCount = pdf.PageCount,
Pages = new List<PageAnalysis>()
};
foreach (var page in pdf.Pages)
{
var textObjects = page.ObjectModel.GetTextObjects().ToList();
var imageObjects = page.ObjectModel.GetImageObjects().ToList();
analysis.Pages.Add(new PageAnalysis
{
TextCount = textObjects.Count,
ImageCount = imageObjects.Count,
UniqueFont = textObjects.Select(t => t.FontName).Distinct().Count(),
TotalTextLength = textObjects.Sum(t => t.Text?.Length ?? 0)
});
}
return analysis;
}
}
public class DocumentAnalysis
{
public int PageCount { get; set; }
public List<PageAnalysis> Pages { get; set; }
}
public class PageAnalysis
{
public int TextCount { get; set; }
public int ImageCount { get; set; }
public int UniqueFont { get; set; }
public int TotalTextLength { get; set; }
}
PDF DOM access opens the document's internal structure. Use it for inspection and analysis, but prefer higher-level APIs for modification.
For more details on DOM access, see the IronPDF DOM documentation.
Written by Jacob Mellor, CTO at Iron Software. Jacob created IronPDF and leads a team of 50+ engineers building .NET document processing libraries._
Top comments (0)