DEV Community: DroWzYOzIl

How to split PDF File in C# (Code Example Tutorial)

DroWzYOzIl — Wed, 09 Nov 2022 06:48:51 +0000

Are you looking for how to convert a single PDF document int multiple pdf files using C# .NET programming language?

In this tutorial we will discuss how to split single PDF file into multiple PDF documents. There are various advantages to splitting a PDF, and you should be able to do it while keeping the entire document intact. One of the easiest ways to reduce the size of a PDF file without damaging it is IronPDF the C# .NET Library specifically created for creating and manipulating.

IronPDF

IronPDF PDF library is a robust Hyper Text Markup Language to PDF conversion API. IronPDF is an all-in-one solution for converting HTML sites in .NET and .NET Core development. It not only converts HTML but also performs a number of other tasks. IronPDF enables developers to generate, modify, and retrieve PDF files from .NET framework and Core projects. IronPDF allows developers to effortlessly create or modify high-fidelity PDFs from HTML pages.

PDF files are lightweight and a great mode of sending content over the internet with ease. Creating such PDF files Using IronPDF is easy, efficient, time saving and reduce workload on developers by one third.

To get started with IronPDF all you need is Visual studio and C# installed on your system and follow this step-by-step tutorial.

1. Creating a New Project in Visual Studio

Open the Visual studio editor

Go to the File menu in Visual Studio after starting it up. Select Console Application after choosing "New project".

In the relevant text box, type the project name and choose the path. Then, click the Create button. Select the required .NET framework, as in the screenshot below:

Then a new window will appear select the target framework and click on create.

program.cs file will open so you can enter the logic and create/run the application.

Now we can add the IronPDF library and test the program.

2. Install the IronPDF C# Library

The IronPDF library can be downloaded and installed in many different ways. but we will only discuss only three of those:

Using the Visual Studio NuGet Package Manager

Using the Visual Studio Command-Line

Direct Download from the NuGet webpage

2.1. Using the Visual Studio NuGet Package Manager

The NuGet Package Manager option is available in the Visual Studio software to install the package directly into the solution. The below screenshot shows how to open it.

It has a search box that returns a list of the package libraries available on the NuGet website. As shown in the screenshot below, we need to search for the keyword "IronPDF" in the package manager.

We can see the list of linked packages from the search in the image up above. We need to select the IronPDF option and install the package to our solution.

2.2. Using the Visual Studio Command-Line

In Visual Studio menu go to Tools, move the cursor to NuGet package Manager and click on Package Manager Console.

Package Manager Console will appear in the bottom of the screen just write the following command and press enter and IronPDF will install in an instance.
Install-Package IronPDF

2.3 Direct Download from the NuGet webpage

The third way is to download the NuGet package directly from the webpage.

Navigate to the link "https://www.nuget.org/packages/IronPdf/"

Select the download package option from the menu on the right-hand side.

Double-click the downloaded package; it will be installed automatically.

Now reload the solution and begin using it in the project.

3. Splitting PDF document Using IronPDF

IronPDF offers PDF document manipulation to split one pdf document to two or more output PDF files. To split pdf files all you need is a source PDF file or create a new PDF document then split a PDF document using page numbers. Using IronPDF you can split PDF file into multiple files to reduce large PDF documents or to reduce PDF file size. Multiple single page files can be created using this C# .NET library.

In this tutorial we will discuss two different examples of splitting a three-page PDF document into two files and three files.

3.1. Slitting PDF file into two PDF files

Here we will see how to split single PDF document in to two individual files using C# programming language and IronPDF PDF library.

Following code sample shows how to render all the pages of the PDF document and split in into multiple PDF files.



using IronPdf;

// Instantiate Renderer
var pdf = new PdfDocument("url.pdf");
//take the first page
var pdf_page1 = pdf.CopyPage(0);
pdf_page1.SaveAs("Spli1.pdf");
//take the pages 2 & 3
var pdf_page2_3 = pdf.CopyPages(1, 2);
pdf_page2_3.SaveAs("Spli2.pdf");

OUTPUT File 1

This output shows the result of above code sample shows the output of three-page PDF document split into a single page.

OUTPUT File 2

This output image shows the output of remaining two pages of the original PDF document.

3.2. Splitting PDF Document into multiple Single Page Files

Here we will split a PDF document into three single page PDF document using IronPDF split function.



using IronPdf;

// Instantiate Renderer
var pdf = new PdfDocument("url.pdf");

//take the first page
var pdf_page1 = pdf.CopyPage(0);
pdf_page1.SaveAs("Spli1.pdf");

//take the page 2 
var pdf_page2 = pdf.CopyPage(1);
pdf_page2.SaveAs("Spli2.pdf");

//take the page 3
var pdf_page3 = pdf.CopyPage(2);
pdf_page3.SaveAs("Spli3.pdf");

OUTPUT File 1

OUTPUT File 2

OUTPUT File 3

4. Conclusion

In this tutorial article we have discussed how to split a PDF document into multiple pages using IronPDF library. here we show how easy it is to split PDF file into two or three different PDF files based on their pages. Also, IronPDF keeps the output identical to original PDF document and also maintain the integrity of the Links used and formatting.
For more examples on IronPDF please visit this link. and to know about how IronPDF works and features its offers to its users please visit this link.
If you buy the complete Iron Suite, you will get all 5 Products for the Price of 2. For further details about the licensing, please follow this link to Purchase the complete Package.

How to use AWS Textract in Python

DroWzYOzIl — Tue, 11 Oct 2022 06:47:49 +0000

1. Introduction

Amazon Textract is a machine learning service that extracts text, handwriting, and data from scanned documents. To recognize, comprehend, and extract data from forms and tables, it goes beyond simple optical character recognition (OCR). These days, a lot of businesses either manually extract data from scanned documents like PDFs, pictures, tables, and forms or use basic OCR software that needs to be manually configured (which often must be updated when the form changes). Textract uses ML to read and process any form of document, accurately extracting text, handwriting, tables, and other data without requiring manual labor to replace these time-consuming and expensive operations.

2. What is AWS Textract?

AWS Textract, to put it simply, is a deep learning-based service that transforms many types of documents into an editable format. Consider the situation where we have hard copies of invoices from several businesses and keep all the important data from them on Excel/Spreadsheets. Most of the time, we rely on data entry workers to manually enter them, which is chaotic, time-consuming, and prone to error. Amazon Textract is a software that extracts data and text from document images automatically. However, using Textract, all we have to do is upload our invoices, and it will then return all the text, forms, key-value pairs, and tables in a better organised manner.

2.1. Strong and Normalized Data Capture

With the help of Amazon Textract, text and tabular data may be extracted from a range of documents, including financial records, scientific articles, and medical notes. The extraction of unstructured and structured data from your document will be much simpler thanks to these non-custom APIs, which continuously learn from a large quantity of data every day.

2.2. Building a smart search index

You may build text libraries from images and PDF files using Amazon Textract. Using Amazon Textract's smart text extraction for Nlp, you can extract text into words and lines (NLP). If Amazon Textract document table analysis is turned on, the text is also organised by table cells. You have control over how text is organised when using Amazon Textract as an NLP input.

2.3. How Does Textract by Amazon (AWS) Work?

We'll go over AWS Textract's operation in this part. There are no open-source models to go into the intricacies, but we know that powerful AI and ML algorithms are behind them. But by summarising the available documentation, I'll attempt to unravel the workings.

The first thing that happens whenever a new or scanned document is sent into Textract is that it generates a list of block objects for all the identified text. For instance, if a bill contains 100 words today, AWS will create 100 block objects for all of the words. These blocks contain details on an object that has been detected, its location, and the level of confidence Amazon Textract has in the processing's accuracy.

Most documents typically consist of the following building blocks:

Text lines and words per page
Input data (Key-value pairs)
Elements for Selecting Tables and Cells

With the help of Connecting with many other Amazon Web Services, you can automate the workflow of extraction, processing, and storing the relevant data.

2.4. OCR Textract

Textract OCR is likewise a deep learning-based neural network-based architecture, however it cannot be fully customized or trained on a specific dataset. This is why corporations have typically employed positions such as data entry operators for simple document filling and database completion. Its job is to read a document and extract all of the data contained inside it. Textract, on the other hand, automatically adjusts to your data and achieves improved accuracy on the fly if the extracted information is verified by a person (human in the loop). Textract outperforms Tesseract when it comes to tasks like table extraction and key-value pair extraction. However, it is confined to a few languages and document types.

Here are the some of the document types that AWS Textract can process are listed below:

Regular Bills / Invoices
Financial Records
Medical Records
Documents written by hand
Paystubs or Personnel Records

When it comes to AWS Textract, there are three primary sorts of outcomes we may acquire. The first is to obtain the extracted result in the form of raw text. The second way is to obtain the key-value pairs found in the associated documents. The third option is to extract the table data. Amazon Textract allows us to construct text libraries from image and PDF files.

AWS provides two way to extract the text. they are

AWS CLI
Python API

2.5. AWS CLI to extract data

A centralised tool for managing your AWS services is the AWS Command Line Interface (AWS CLI). You can use the command line and scripts to automate various AWS services with only one tool that you download and configure.

Through the AWS Console, AWS CLI, Textract API, and even programmatically using compatible client SDKs, we can use the AWS OCR Textract service. But in this lesson, you'll use the AWS CLI to extract content from photos. The following are the steps to convert.

2.5.1. Step 1:

Use the commands listed below to create (mkdir) and change (cd) into a new directory in your computer's terminal. The directory can be given whatever name you like, but for the purposes of this demo, it will be called textract-extraction. The photos whose text you use Textract to extract will be found in the new directory.

mkdir textract-extraction && cd textract-extraction

2.5.2. STEP 2:

Use the wget command listed below to download the Test.jpg public image. Using the Textract service, you may extract the handwritten sentence from the image.

wget https://raw.Images.com/adam-the-automator/awsocr/Test.jpg

2.5.3. STEP3:

Execute the textract command below to extract the text from the Test.jpg image (detect-document-text) and output the data in JSON format (—output json).A list with a Byte whose value is the base64 representation of the Test.jpg file can be found in the -document flag.

aws textract detect-document-text --document Bytes=$( base64 ./Test.jpg ) --output json

Once the command completes, you’ll see the printed JSON output below. we can extract data based on the requirement.

To determine the next layer scanned in the document, take note of the first highlighted value in the IDs array.

2.6. Using Python API to extract data

We can use the Amazon Textract API with a variety of computer languages. We'll examine a code block for key-value extraction using Python and Textract in this section. Check out these docs for more details on language and API support.

This code snippet shows how to extract key-value pairs from documents using the Python Textract API. To make this work, we'll also need to configure API keys on the AWS dashboard.

We import every package required for sending documents to AWS and handling the text extraction.

import boto3
import sys
import re
import json
    with open(file_name, 'rb') as file:
        img_test = file.read()
        bytes_test = bytearray(img_test)
        print('Image loaded', file_name)
    client = boto3.client('textract')
    response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['FORMS'])

Once the process completes, you’ll see the printed JSON output below. we can extract data based on the requirement. Using Json Packages to extract specific information the returned json result.

3. IronOCR — Introduction and Features

Engineers that use IronOCR for.NET to read text content from images and PDFs in.NET apps and websites are provided with software from IronOCR. It supports many different world languages, scans images for text and barcodes, and can output either plain text or structured data. MVC, Web, console, and desktop .NET applications can all use Iron Software's OCR library. The development team directly assists with licencing for commercial deployments.

Using the latest Tesseract 5 engine, IronOCR scans text, barcodes, and QR codes from any picture or PDF file. This library allows desktop, console, and internet programs to quickly incorporate OCR.

The 127 international languages supported by IronOCR. Word lists and custom languages are also supported.

More than 20 different barcode and QR code types can be scanned by IronOCR.

3.1. OCR Using IronPDF

Above is an illustration of the Tesseract 5 API, which enables us to turn image files into text. In the line of code above, we're creating an object for the Iron Tesseract. Additionally, we are creating an OcrInput object that will enable us to include one or more image files. When using the OcrInput object method add, we might need to specify the picture's path inside the code. You can upload as many photos as you like. By parsing the image file and extracting the result into the OCR result, we can use the function "Read" in the Object IronTesseract that we previously built to obtain the photos. It has the ability to extract text from images and turn it into a string.

var Ocr = new IronTesseract();            
Ocr.Language = OcrLanguage.EnglishBest;                                     
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;            
using (var Input = new OcrInput())      
{          
    Input.AddImage(@"3.png");         
    var R = Ocr.Read(Input);       
    Console.WriteLine(R.Text);        
    Console.ReadKey();          
}

4. Conclusion.

Compared to other providers, setting up Textract with another AWS service is simple. For instance, by adding an add-on, it is possible to save extracted document information using Amazon DynamoDB or S3.The AWS shared responsibility model, which entails rules and norms for data protection, is followed by Amazon Textract. AWS ocr is so costly. We always need to have InterNet to make the OCR works.

Tesseract is simple and quick to use when using IronOCR in the Net Framework environment. It offers numerous ways to support images and PDF files. Additionally, it offers a number of parameters to enhance the functionality of the Tesseract OCR library. Multiple languages can be used simultaneously in addition to different languages. Visit their website to learn more about the Tesseract OCR.

IronOCR outperforms AWS OCR. AWS OCR is so expensive that there is no development version, and in order to register an AWS account, we must have a valid payment card. IronOCR, on the other hand, is less expensive and offers a development edition. It also enables us to detect barcode data and read barcode values from pictures. The IronOCR bundle includes a lifetime license with no continuing fees. At a single purchase, the IronOCR package covers numerous systems. To learn more about IronOCR pricing, visit here.

C# Print PDF Directly to Printer (Code Example Tutorial)

DroWzYOzIl — Thu, 19 May 2022 07:15:33 +0000

Introduction

Are you looking to print a PDF files programmatically using C# .NET platform? You have come to the right place in this article we will discuss how you can print PDF documents using C#.
First let us get to know how PDF printing works. Digital images and text are converted into tangible copies by printers. They achieve this by converting the file into a form that the printer can comprehend using a driver or software. A succession of tiny dots are then used to replicate the picture or text on the page.
IronPDF is .NET library that helps you print programmatically PDF document directly. Also IronPDF Provide a lot of advanced printing features like, IronPDF is capable of handling complex printing capabilities such as locating and setting the printer name, as well as changing the printer resolution.

IronPDF C# Library

In. NET and .NET core development, IronPDF is the ideal tool for printing PDF files. It not only prints PDF files, but it also has a lot of other useful functions. IronPDF allows developers to create, alter, and retrieve PDF documents from within .NET Core and framework projects. The IronPDF library allows developers to rapidly produce or alter PDFs printing.
With the use of Visual Basic or C# code, you may quickly print a PDF in .NET apps. This article will show you how to print programmatically using C# .NET print PDF capabilities. IronPDF provide many programmatic printing functionalities for printing pdf file like setting printer resolution, select specific printer.

Creating a New Visual Studio project

Go to the File menu in the Visual Studio program. Choose "new project," then "console application." In this topic, we'll produce PDF documents using a console program.

In the relevant text box, type the project name and choose the path. Then press the create button. As seen in the screenshot below, select the required .NET framework:

Now the new project has been created go ahead and integrate IronPDF library. In the next example you will find different ways to integrate IronPDF in Visual Studio.

Install IronPDF Library

IronPDF provides several options to integrate IronPDF into your Visual Studio project, however we will only cover two of them in this post.

1. Using the Visual Studio NuGet Package Manager

The NuGet Package Manager option in Visual Studio allows you to install the package directly into the solution. The screenshot below demonstrates how to access the Package Manager in solution explorer.
It has a search box that displays a list of NuGet package libraries. We need to look for the keyword "IronPDF" in the package manager, as seen below:
You will get a list of relevant packages search list from the image above. You must choose the IronPDF option and install the package.

2. Using the Visual Studio Command-Line

In Visual Studio menu, Go to Tools-> NuGet Package manager -> Package manager console
Enter the following line in the package manager console tab:
Install-Package IronPDF
Now the package will download/install to the current project and be ready to use

Print PDF Files

When it comes to print pdf files IronPDF provide two options, first send a PDF document directly to a printer, Secondly you can create an object sender that sent to GUI print dialogs. Below code snippet shows is the very quick demonstration how to print pdf files by using both options

Direct print pdf files (silent printing)



using IronPdf;

IronPdf.ChromePdfRenderer Renderer = new IronPdf.ChromePdfRenderer();

using PdfDocument Pdf = Renderer.RenderUrlAsPdf("https://www.nuget.org/packages/IronPdf");

Pdf.Print();

Using Object (print dialog)



using IronPdf;

IronPdf.ChromePdfRenderer Renderer = new IronPdf.ChromePdfRenderer();

using PdfDocument Pdf = Renderer.RenderUrlAsPdf("https://www.nuget.org/packages/IronPdf");

PdfDocument.GetPrintDocument

//Remember to add an assembly reference to System.Drawing.dll

 System.Drawing.Printing.PrintDocument PrintDocYouCanWorkWith = Pdf.GetPrintDocument();

Advanced PDF document printing IronPDF

IronPDF can handle complex printing capabilities such as identifying or establishing the printer name, changing the printer resolution, printing PDF to file, and tracing printing processes.

Specify Printer Name

if you have multiple printer available to perform printing process IronPDF allow you to select specific printer for printing process. if you do not specify printer name then use the default printer to print



/**

Specify Printer Name

anchor-specify-printer-name

**/

using (var ChromePdfRenderer = new ChromePdfRenderer())

{

using (var pdfDocument =

ChromePdfRenderer.RenderHtmlAsPdf(TestSources.HtmlTemplateBasicText()))

{

using (var printDocument = pdfDocument.GetPrintDocument())

{

printDocument.PrinterSettings.PrinterName = "Microsoft Print to PDF";

printDocument.Print();

}

}

}

Printer Resolution

The number of pixels printed or shown, depending on the result, is referred to as resolution. Using the DefaultPageSettings.PrinterResolution option in IronPDF, you may change the resolution of your printing. below code snippet shows how to set printer resolution with just a few lines of code.



/**

Set Printer Resolution

anchor-set-printer-resolution

/

printDocument.DefaultPageSettings.PrinterResolution = new PrinterResolution

{

**Kind = PrinterResolutionKind.Custom,

X = 1200,

Y = 1200

};

printDocument.Print();

}

}

}

printing PDF to File

The PDFDocument.PrintToFile function allows you to print a PDF to a file; simply provide the output filepath and indicate whether or not you want to get a preview.



/**

PrinttoFile

anchor-printtofile-method

**/

printDocument.PrintToFile(“PathToFile”, false);

Tracing Printing Processes

The benefit of using C# with IronPDF is that keeping track of printed pages, or anything printing-related, is actually extremely straightforward. In the following example, I will show how to trace the number of pages produced.



/**

Tracing Printing Processes

anchor-tracing-printing-processes-using-c-num

/

using IronPdf;

IronPdf.ChromePdfRenderer Renderer = new IronPdf.ChromePdfRenderer();

using PdfDocument Pdf = Renderer.RenderUrlAsPdf("https://www.nuget.org/packages/IronPdf");

PdfDocument.GetPrintDocument

**var printedPages = 0;

printDocument.PrintPage += (sender, args) => printedPages++;

printDocument.Print();

}

}

}

Conclusion

To print PDF document IronPDF is the perfect solution, it not only print pdf files but also give a lot of control in the printing process like print multiple PDF files or PDF pages, Print specific pages, Set printer resolution, print PDF document to file, and set printer track to track printed page numbers and many more other features. With the help of NuGet Package console you can easily integrate and manage IronPDF and easily print you PDF files. IronPDF provide free version for developers also provide many features to manipulate PDF file as you see fit. for additional information of C# print a PDF visit the following link.
You can download the software product from this link.