DEV Community: W Wolt

The State of OCR in .NET (2026): From Text Extraction to Real Pipelines

W Wolt — Fri, 10 Apr 2026 02:57:39 +0000

Introduction

I’ve integrated OCR into enough systems to know where it actually breaks.

Not in the demo.
Not in the first API call.

It breaks when:

documents are inconsistent
traffic increases
edge cases pile up

If you’re building anything in fintech, operations, or compliance-heavy workflows, OCR stops being a feature very quickly. It becomes part of your backend pipeline.

In 2026, the question is not how to extract text in C#. The question is whether your OCR setup can survive real input, real scale, and real business logic.

This article is based on that reality.

What OCR Looks Like in a Real System

In isolation, OCR looks like this:

var text = ocr.Read("document.png");

In production, it looks more like this:

var file = await storage.GetAsync(fileId);

var image = Preprocess(file);

var rawText = ocr.Read(image);

var structured = parser.Extract(rawText);

var validated = validator.Validate(structured);

await repository.SaveAsync(validated);

OCR is one step in a chain. If you treat it as a standalone feature, you will end up rewriting everything around it later.

Where Things Actually Break

After working with document pipelines in .NET services, the same problems show up every time.

Accuracy is tied to input quality, not the engine

Developers often compare OCR engines like they are interchangeable.

They are not.

Take this:

var text = ocr.Read("invoice.jpg");

If that image is:

slightly rotated
low contrast
compressed

your results degrade fast.

You don’t fix this by switching libraries. You fix it with preprocessing.

Raw text is rarely useful

OCR gives you this:

Invoice Number: INV-2026-001
Total Amount: $1,245.00

Your system needs this:

```json id="kk4k3c"
{
"invoice_number": "INV-2026-001",
"total": 1245.00
}




That gap is where most of the engineering effort goes.

Parsing, validation, error handling. OCR is just the input layer.


### Throughput becomes a problem faster than expected

In a microservice setup, you might start with something like:



```csharp
foreach (var file in batch)
{
    var text = ocr.Read(file);
    Process(text);
}

Now scale that across:

multiple pods
message queues
concurrent requests

You will hit:

CPU saturation
memory pressure
queue delays

OCR is expensive. Treat it like a heavy compute workload, not a simple utility.

Document variability kills assumptions

Even within the same domain, documents are inconsistent.

Two invoices:

different layouts
different labels
different formats

Your OCR pipeline must handle variation, not just extraction.

Hardcoding rules will work for a week. Then it breaks.

The OCR Options Most .NET Developers Use

If you’ve been around .NET long enough, these are the usual paths.

Tesseract OCR

Still the default open source choice.

I’ve used it in multiple systems where cost and control mattered.

using Tesseract;

using var engine = new TesseractEngine("./tessdata", "eng", EngineMode.Default);
using var img = Pix.LoadFromFile("document.png");
using var page = engine.Process(img);

var text = page.GetText();

What you get:

full control
no API dependency
predictable cost

What you deal with:

tuning
preprocessing
inconsistent accuracy out of the box

It works, but you need to put effort into it.

Azure AI Vision OCR

If you want something that works fast with minimal setup, this is usually where teams go.

var result = await client.ReadAsync(stream);

foreach (var line in result.Lines)
{
    Console.WriteLine(line.Text);
}

What you get:

strong accuracy
layout awareness
less setup

What you accept:

API latency
ongoing cost
data leaving your system

This is often the fastest way to production, but not always the best long-term fit.

Hybrid approach

This is what I see more teams doing now.

var text = localOcr.Read(file);

if (IsLowConfidence(text))
{
    text = await cloudOcr.ReadAsync(file);
}

You keep:

cost under control
latency manageable

And still handle:

edge cases with higher accuracy

This pattern scales better in real systems.

What Actually Matters When Choosing OCR

Forget feature lists. These are the decisions that matter.

Accuracy in your specific context

OCR accuracy is not universal.

Test with your documents:

scanned PDFs
mobile photos
compressed files

What works in a demo may fail in your pipeline.

Integration into your architecture

If you are running:

ASP.NET APIs
background workers
message queues

Then your OCR needs to:

handle concurrency
avoid blocking threads
fit into async workflows

Example:

await Task.Run(() => ocr.Read(file));

Even this can become a bottleneck if not managed properly.

Deployment constraints

In containerized environments:

FROM mcr.microsoft.com/dotnet/aspnet:8.0

You need to think about:

CPU limits
memory limits
scaling behavior

Some OCR engines are not friendly in containers without tuning.

Data privacy requirements

If you are dealing with:

personal identity documents
financial records

Sending data to external APIs may not be acceptable.

This alone can eliminate certain options.

What Has Changed in 2026

OCR is now part of a broader document pipeline

The flow is no longer:

var text = ocr.Read(file);

It is:

var image = Preprocess(file);

var raw = ocr.Read(image);

var structured = parser.Parse(raw);

var enriched = await ai.Enrich(structured);

await Save(enriched);

OCR feeds into systems. It is not the end result.

AI is handling what used to be manual parsing

Instead of writing complex rules:

var total = Regex.Match(text, @"Total:\s+\$(\d+)").Groups[1].Value;

You now see:

var structured = await ai.Extract(text);

This reduces:

brittle parsing logic
maintenance overhead

But introduces:

dependency on model behavior
need for validation

Preprocessing is no longer optional

You will get better results doing this:

var processed = image
    .ToGrayscale()
    .IncreaseContrast()
    .Deskew();

Than switching OCR engines.

This is one of the most overlooked parts of OCR pipelines.

Scaling OCR is now an architecture problem

You do not scale OCR by writing better code.

You scale it by:

queueing workloads
distributing processing
controlling concurrency

Typical pattern:

await queue.Publish(fileId);

Worker:

var file = await queue.Consume();
var result = ocr.Read(file);

This is where microservices and background processing matter.

How I Approach OCR in .NET Projects

After enough iterations, this is the approach that holds up.

Start with real documents, not samples.
Build preprocessing early.
Treat OCR as a compute-heavy service.
Separate extraction from interpretation.
Add validation layers.

And most importantly, expect edge cases.

Where This Fits in the Bigger Picture

OCR is not the end of the pipeline.

It sits at the start.

Typical flow in modern systems:

OCR extracts data from documents
services process and validate it
PDFs present final outputs
Excel and Word handle structured workflows

If you get OCR wrong, everything downstream becomes harder.

Final Thoughts

OCR in .NET has matured, but the challenges have not disappeared.

You can extract text in minutes.
You will spend weeks making it reliable.

If you are choosing a .NET OCR approach in 2026, optimize for:

how it behaves with your real data
how it scales in your architecture
how it integrates with the rest of your pipeline

Everything else is secondary.

Best C# PDF Library in 2026? A Real-World .NET Comparison

W Wolt — Thu, 09 Apr 2026 06:28:48 +0000

Introduction

If you’ve been building in .NET long enough, you already know this pattern.

At some point, every system needs to generate a PDF.

Invoices. Reports. Contracts. Audit trails. Export features that started as “nice to have” become critical workflows.

And then someone searches “C# PDF library” or “.NET PDF library” and assumes the problem is solved.

It isn’t.

Generating a PDF in C# is easy. Getting that PDF to render correctly across environments, scale under load, and remain consistent over time is where things start to break.

After shipping multiple systems across ASP.NET APIs, containerized workloads, and document-heavy pipelines, the decision in 2026 is no longer about features. It’s about behavior in production.

This piece breaks down how PDF libraries actually perform in real systems, what trade-offs matter, and how to choose something that won’t slow you down later.

This also sets up the broader picture. PDF is just one part of the document stack. OCR, Excel, and Word pipelines sit right next to it, and they all connect.

We’ll get there in the next articles.

The Problems You Only See in Production

Most blog posts stop at “create PDF from HTML” examples.

That’s not where systems fail.

Layout breaks across environments

You render a PDF locally. Looks perfect.

Deploy to Linux containers. Suddenly:

fonts are missing
spacing shifts
page breaks cut content in half

Example scenario. You reuse a Razor view and convert it to PDF.

var html = await _viewRenderService.RenderToStringAsync("Invoice", model);
var pdf = renderer.RenderHtmlAsPdf(html);
pdf.SaveAs("invoice.pdf");

Looks clean. Works locally.

Then production hits:

different font fallback
different DPI handling
subtle CSS differences

Now your invoice layout is off by just enough to cause support tickets.

Performance collapses under load

Rendering one PDF is trivial.

Rendering thousands per hour in a microservice is not.

Typical pattern in a Web API:

[HttpPost("generate-report")]
public IActionResult GenerateReport([FromBody] ReportRequest request)
{
    var html = _reportService.BuildHtml(request);
    var pdf = _pdfRenderer.RenderHtmlAsPdf(html);

    return File(pdf.BinaryData, "application/pdf", "report.pdf");
}

This works fine until:

concurrent requests increase
memory spikes
CPU saturates

Now your PDF service becomes the bottleneck in your system.

Output inconsistency becomes a real issue

You deploy a new version of your service.

Same input. Different output.

Even small changes matter in:

financial reports
legal documents
audit logs

If your PDFs are not reproducible, you lose trust in your system.

The Libraries Everyone Evaluates

If you search “best C# PDF library”, you’ll see the same set of tools.

Each one solves a different version of the problem.

iText 7

This is the enterprise baseline.

It gives you deep control over PDF structure. You can manipulate content at a very low level.

Example:

using (var writer = new PdfWriter("output.pdf"))
using (var pdf = new PdfDocument(writer))
using (var document = new Document(pdf))
{
    document.Add(new Paragraph("Hello PDF"));
}

Where it fits:

complex PDF manipulation
stamping, merging, signing workflows

Trade-offs:

licensing is not trivial
more verbose API
steeper learning curve

PDFsharp

One of the most common open source options.

Example:

var document = new PdfDocument();
var page = document.AddPage();

var gfx = XGraphics.FromPdfPage(page);
gfx.DrawString("Hello PDF", new XFont("Arial", 20), XBrushes.Black, new XPoint(20, 50));

document.Save("output.pdf");

Where it fits:

simple document generation
low complexity use cases

Trade-offs:

limited modern capabilities
not suitable for HTML-to-PDF
struggles with scale

QuestPDF

This is a different approach. You define layout in code.

Example:

Document.Create(container =>
{
    container.Page(page =>
    {
        page.Content()
            .Text("Hello PDF")
            .FontSize(20);
    });
}).GeneratePdf("output.pdf");

Where it fits:

structured documents
predictable layouts

Trade-offs:

not HTML-based
requires writing layout logic in code
less flexible for dynamic content reuse

wkhtmltopdf (via wrappers)

Still widely used for HTML-to-PDF.

Example:

var converter = new BasicConverter(new PdfTools());
var doc = new HtmlToPdfDocument()
{
    Objects =
    {
        new ObjectSettings { HtmlContent = html }
    }
};

var pdf = converter.Convert(doc);

Where it fits:

quick HTML-to-PDF setups
legacy systems

Trade-offs:

outdated rendering engine
limited CSS support
inconsistent behavior in modern environments

IronPDF

This one addresses the container and consistency problems directly.

It uses a Chromium rendering engine, so what you see locally is what you get in production. No font fallback surprises, no DPI differences between Windows and Linux.

Example:


var renderer = new ChromePdfRenderer();
var pdf = renderer.RenderHtmlAsPdf(html);
pdf.SaveAs("output.pdf");

Where it fits:

HTML-to-PDF pipelines
containerized deployments
high-volume rendering

Trade-offs:

commercial license
larger footprint than minimal libraries

The Chromium engine handles modern CSS and JavaScript, which solves the rendering inconsistency issue you hit with wkhtmltopdf. It also includes support for headers, footers, digital signatures, and merging without pulling in additional dependencies.

For teams already reusing Razor views or frontend templates, this removes most of the workaround code.

More details at https://ironpdf.com/

The Trade-Offs That Actually Matter

After working across multiple systems, the decision usually comes down to these.

HTML vs programmatic layout

If your content already exists as HTML:

reports from Razor
UI templates
email templates

Then HTML-to-PDF makes sense.

If your content is structured data:

tables
calculated fields
fixed layouts

Then programmatic libraries are more predictable.

Engineering time vs licensing

Open source looks cheaper.

Until:

you spend time fixing edge cases
debugging rendering issues
building workarounds

Paid libraries shift that cost into licensing.

In most real systems, engineering time is the bigger cost.

Throughput vs rendering accuracy

High fidelity rendering engines:

consume more memory
take longer to process

Faster engines:

simplify layout
sacrifice CSS support

You need to align this with your workload.

What Changed Around 2026

This is where older advice starts to break.

Everything runs in containers now

Most .NET apps are deployed via:

Docker
Kubernetes
CI/CD pipelines

If your PDF library requires:

manual OS setup
system-level dependencies

it introduces friction immediately.

You want something that behaves the same:

locally
in staging
in production

HTML reuse is the default

Teams are not building document layouts twice.

They reuse:

Razor views
frontend templates
shared design systems

Example:

var html = await _razorRenderer.RenderAsync("ReportTemplate", model);
var pdf = _pdfService.Render(html);

This reduces duplication but increases pressure on the rendering engine.

AI is increasing document volume

LLMs are generating:

summaries
reports
structured outputs

PDF becomes the final delivery format.

This increases:

volume of generation
importance of consistency
need for automation

How I Evaluate a PDF Library Now

After enough iterations, the evaluation becomes straightforward.

1. Can it survive my deployment model?

Test in:

Linux containers
production-like environments

If it behaves differently, it’s a red flag.

2. Does it handle concurrency?

Simulate load:

Parallel.For(0, 100, i =>
{
    var html = GenerateHtml(i);
    var pdf = renderer.RenderHtmlAsPdf(html);
});

Watch:

memory usage
CPU spikes
failures under pressure

3. Is the output consistent?

Generate the same document:

across environments
across deployments

Compare results.

If they differ, it will become a problem later.

4. How much workaround code do I need?

The more patches you write:

the harder it becomes to maintain
the more fragile your system gets

Where This Is Going

PDF is no longer a utility feature.

It sits in the middle of:

user-facing workflows
compliance requirements
backend automation

And it doesn’t exist alone.

In most real systems:

OCR feeds data into PDFs
Excel exports complement reports
Word documents handle editable workflows

This is becoming a document pipeline, not isolated features.

What We’ll Cover Next

This article focused on PDF generation.

Next in the series:

OCR in .NET and how extraction pipelines actually work
Excel libraries and large dataset handling
Word generation and template-driven workflows

These pieces connect more than most teams expect.

Final Thoughts

After 15 years working with C#, APIs, and distributed systems, this is the pattern I keep seeing.

The problem is never generating the document.

The problem is everything around it:

consistency
scale
reliability

If you’re choosing a C# PDF library in 2026, optimize for:

predictable behavior in production
performance under load
minimal operational friction

Because once this sits in your pipeline, replacing it later is not trivial.