DEV Community

IronSoftware
IronSoftware

Posted on

Sanitize PDF in C# (.NET 10 Guide)

PDF files are everywhere in enterprise software, but they're also one of the most common attack vectors. I learned this the hard way when a client's document upload system became the entry point for a malware payload embedded in a seemingly innocent PDF invoice.

That incident pushed me to implement proper PDF sanitization in C#. Here's what worked.

Why Do PDFs Need Sanitizing?

PDFs can contain JavaScript, embedded files, forms with active content, and metadata that leaks sensitive information. These features make PDFs functional but also dangerous.

Common PDF-based attacks include:

  • JavaScript execution that triggers malware downloads
  • Embedded files hiding ransomware payloads
  • Form fields collecting user data maliciously
  • Metadata exposing document history and author details

If your application accepts PDF uploads from users, you need sanitization.

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var uploaded = PdfDocument.FromFile("user-upload.pdf");
var clean = Cleaner.SanitizeWithSvg(uploaded);
clean.SaveAs("safe-output.pdf");
Enter fullscreen mode Exit fullscreen mode

How Does PDF Sanitization Work?

The most reliable sanitization technique converts the PDF to a static image format, then rebuilds it as a clean PDF. This strips out all active content while preserving the visual appearance.

IronPDF offers two approaches:

SVG-based sanitization (faster, maintains text searchability):

using IronPdf;

var pdf = PdfDocument.FromFile("input.pdf");
var sanitized = Cleaner.SanitizeWithSvg(pdf);
sanitized.SaveAs("output.pdf");
Enter fullscreen mode Exit fullscreen mode

Bitmap-based sanitization (slower, more consistent layout):

var pdf = PdfDocument.FromFile("input.pdf");
var sanitized = Cleaner.SanitizeWithBitmap(pdf);
sanitized.SaveAs("output.pdf");
Enter fullscreen mode Exit fullscreen mode

I typically use SVG because speed matters in production systems, and searchable text is valuable for downstream processing.

What's the Difference Between SVG and Bitmap Sanitization?

SVG sanitization converts each page to scalable vector graphics. It's fast and produces smaller files with selectable text. However, complex layouts with heavy CSS or unusual fonts might not render identically.

Bitmap sanitization renders pages as raster images at a specified DPI. It's pixel-perfect but creates larger files, and all text becomes an image (not searchable unless you run OCR afterward).

For most business documents like invoices, contracts, and reports, SVG works great. For forms with precise alignment requirements, I use Bitmap.

How Do I Scan PDFs for Threats Before Sanitizing?

IronPDF includes a ScanPdf method that checks for known malicious patterns using YARA rules:

using IronPdf;

var pdf = PdfDocument.FromFile("suspicious.pdf");
var scan = Cleaner.ScanPdf(pdf);

if (scan.IsDetected)
{
    Console.WriteLine($"Threats found: {scan.Risks.Count}");
    foreach (var risk in scan.Risks)
    {
        Console.WriteLine($"- {risk.Rule}: {risk.Description}");
    }
}
Enter fullscreen mode Exit fullscreen mode

YARA rules identify suspicious patterns like embedded JavaScript, shell commands, or obfuscated content. The default ruleset covers common attack vectors, but you can provide a custom YARA file if your organization has specific security requirements:

var scan = Cleaner.ScanPdf(pdf, "custom-rules.yara");
Enter fullscreen mode Exit fullscreen mode

This two-step approach (scan first, sanitize second) gives you visibility into what threats existed before cleaning.

Can I Customize the Sanitized PDF Layout?

Yes, through ChromePdfRenderOptions. This is helpful when sanitization changes margins or page size in ways that break your downstream processes.

using IronPdf;

var options = new ChromePdfRenderOptions
{
    MarginTop = 20,
    MarginBottom = 20,
    MarginLeft = 15,
    MarginRight = 15,
    PaperSize = PdfPaperSize.A4
};

var pdf = PdfDocument.FromFile("input.pdf");
var sanitized = Cleaner.SanitizeWithSvg(pdf, options);
sanitized.SaveAs("output.pdf");
Enter fullscreen mode Exit fullscreen mode

I've used this when integrating with legacy systems that expected PDFs with exact margin specifications. Without the render options, the sanitized PDFs failed validation in the downstream workflow.

What About Alternatives to IronPDF?

I've tried several approaches:

GhostScript requires command-line execution and doesn't integrate cleanly into .NET applications. You're spawning processes and parsing output, which is fragile in production.

iTextSharp can manipulate PDFs but doesn't have built-in sanitization. You'd need to manually strip JavaScript, flatten forms, and remove embedded files, which took me over 200 lines of code and still missed edge cases.

Puppeteer (via PuppeteerSharp) can render PDFs in a headless browser, but that's overkill for sanitization and introduces another heavy dependency.

IronPDF solved this in three lines of code with better performance than the alternatives I tested.

How Do I Handle Sanitization Errors?

Not all PDFs sanitize cleanly. Corrupted files, password-protected documents, and PDFs with non-standard encodings can fail.

using IronPdf;

try
{
    var pdf = PdfDocument.FromFile("input.pdf");
    var sanitized = Cleaner.SanitizeWithSvg(pdf);
    sanitized.SaveAs("output.pdf");
}
catch (Exception ex)
{
    Console.WriteLine($"Sanitization failed: {ex.Message}");
    // Log error, reject upload, or fallback to bitmap sanitization
}
Enter fullscreen mode Exit fullscreen mode

In my implementation, I log failures to Application Insights and send an alert if the failure rate exceeds 5% in a rolling 24-hour window. Most failures are malformed uploads, not library issues.

What's the Performance Impact of Sanitization?

For typical business documents (5-20 pages), SVG sanitization takes 200-500ms on a standard Azure App Service instance. Bitmap sanitization is 2-3x slower depending on the DPI setting.

I built a batch processor that sanitizes uploaded PDFs asynchronously using Azure Functions with a consumption plan. Each function invocation handles one PDF, so the system scales automatically under load.

// Azure Function example
[FunctionName("SanitizePdf")]
public static async Task Run(
    [BlobTrigger("uploads/{name}")] Stream input,
    string name,
    [Blob("sanitized/{name}")] Stream output,
    ILogger log)
{
    var pdf = PdfDocument.FromStream(input);
    var sanitized = Cleaner.SanitizeWithSvg(pdf);
    sanitized.Stream.CopyTo(output);

    log.LogInformation($"Sanitized {name}");
}
Enter fullscreen mode Exit fullscreen mode

This architecture handles thousands of PDFs per hour without manual scaling.

Does Sanitization Remove Metadata?

Yes, converting to image and back strips most metadata. You lose author names, creation dates, software used, and document history.

If you need to preserve specific metadata fields, extract them before sanitization and re-apply afterward:

using IronPdf;

var pdf = PdfDocument.FromFile("input.pdf");

// Extract metadata before sanitization
var title = pdf.MetaData.Title;
var author = pdf.MetaData.Author;

// Sanitize
var sanitized = Cleaner.SanitizeWithSvg(pdf);

// Re-apply safe metadata
sanitized.MetaData.Title = title;
sanitized.MetaData.Author = "Verified User"; // Don't trust original author field

sanitized.SaveAs("output.pdf");
Enter fullscreen mode Exit fullscreen mode

I recommend setting metadata to safe defaults rather than trusting the original values, which could contain malicious payloads.

What About PDFs with Forms?

Sanitization flattens form fields, converting them to static text. This is intentional since form fields can execute JavaScript on user interaction.

If your application needs interactive forms, sanitize first, then rebuild the form programmatically using trusted field definitions:

var pdf = PdfDocument.FromFile("form.pdf");
var sanitized = Cleaner.SanitizeWithSvg(pdf);

// Rebuild safe form fields
// (IronPDF form creation code here)
Enter fullscreen mode Exit fullscreen mode

Most of my use cases don't require interactive forms post-sanitization since we're archiving submitted documents, not collecting new data.

Should I Sanitize Every PDF?

It depends on the source. For user uploads, always sanitize. For PDFs your system generates internally, sanitization adds latency without security benefit.

I built a routing layer that checks the PDF source:

public void ProcessPdf(Stream pdfStream, PdfSource source)
{
    if (source == PdfSource.UserUpload || source == PdfSource.Email)
    {
        var pdf = PdfDocument.FromStream(pdfStream);
        pdf = Cleaner.SanitizeWithSvg(pdf);
        StoreSafePdf(pdf);
    }
    else
    {
        StoreTrustedPdf(pdfStream);
    }
}
Enter fullscreen mode Exit fullscreen mode

This keeps the sanitization overhead limited to untrusted sources.


Written by Jacob Mellor, CTO at Iron Software. Jacob created IronPDF and leads a team of 50+ engineers building .NET document processing libraries.

Top comments (0)