DEV Community

IronSoftware
IronSoftware

Posted on

How to Create UTF-8 Unicode PDFs in C# (.NET Guide)

UTF-8 encoding enables PDFs to display international characters — Chinese, Japanese, Arabic, Hebrew, Russian, Thai, and other non-Latin scripts. Without proper UTF-8 handling, these characters appear as question marks, boxes, or missing entirely. I've debugged invoice systems where customer names in Japanese rendered as "????" in PDFs despite displaying correctly on web pages.

The problem is that PDF generation libraries handle character encoding differently. Some assume Latin-1 (ISO-8859-1) encoding by default, truncating Unicode characters outside that range. Others require explicit font configuration for non-Latin scripts. Some libraries from Stack Overflow answers recommend manual BaseFont setup with IDENTITY_H encoding — advice from 2010 that's unnecessarily complex with modern libraries.

IronPDF handles UTF-8 automatically when HTML includes the proper charset declaration. If your HTML specifies <meta charset="UTF-8">, characters encode correctly without additional configuration. This eliminates the font configuration gymnastics required by older libraries where you manually loaded fonts, specified encodings, and handled fallbacks.

Understanding character encoding prevents common internationalization failures. UTF-8 represents characters using variable-length byte sequences — 1 byte for ASCII (English), 2-4 bytes for other languages. PDF internally uses various encodings depending on fonts and content. The bridge between HTML's UTF-8 and PDF's internal encoding happens during rendering. If the rendering engine doesn't support UTF-8 properly, characters get corrupted.

I've migrated document systems serving global customers from libraries requiring manual font embedding to IronPDF's automatic handling. Customer names in Arabic, product descriptions in Chinese, addresses in Russian — all render correctly without font configuration code. The HTML charset declaration is sufficient.

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var renderer = new [ChromePdfRenderer](https://ironpdf.com/blog/videos/how-to-render-html-string-to-pdf-in-csharp-ironpdf/)();

var html = @"
<!DOCTYPE html>
<html>
<head>
    <meta charset=""UTF-8"">
    <style>
        body { font-family: Arial, 'Noto Sans', sans-serif; }
    </style>
</head>
<body>
    <h1>International Characters</h1>
    <p>Chinese: 你好世界</p>
    <p>Japanese: こんにちは世界</p>
    <p>Arabic: مرحبا بالعالم</p>
    <p>Hebrew: שלום עולם</p>
    <p>Russian: Привет мир</p>
    <p>Thai: สวัสดีชาวโลก</p>
</body>
</html>
";

var pdf = renderer.RenderHtmlAsPdf(html);
pdf.SaveAs("international.pdf");
Enter fullscreen mode Exit fullscreen mode

The <meta charset="UTF-8"> declaration tells the rendering engine to expect UTF-8 encoded content. The font-family includes fallback fonts (Noto Sans) that support wide character ranges. This combination ensures international characters render correctly.

Why Do PDFs Need Special UTF-8 Handling?

PDFs don't natively store text as UTF-8. Internally, PDFs use various text encoding schemes depending on fonts and content. Simple ASCII text might use WinAnsiEncoding. Complex Unicode text uses CID (Character Identifier) fonts with Identity-H or Identity-V encoding mappings.

When converting HTML to PDF, the library must:

  1. Parse UTF-8 encoded HTML correctly
  2. Map Unicode characters to font glyphs
  3. Embed fonts supporting those characters
  4. Configure PDF encoding schemes appropriately

If any step fails, characters disappear or display incorrectly. Early PDF libraries handled only Latin-1, limiting content to Western European languages. Modern libraries support Unicode fully, but configuration varies.

Stack Overflow has dozens of questions about UTF-8 in PDFs with answers recommending iTextSharp's BaseFont.CreateFont with IDENTITY_H encoding. These answers work but require manual font file loading and encoding specification:

// Old iTextSharp approach (complex)
BaseFont bf = BaseFont.CreateFont("c:\\windows\\fonts\\msyh.ttc,0",
    BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
Enter fullscreen mode Exit fullscreen mode

This manually loads Microsoft YaHei font for Chinese characters. You specify IDENTITY_H encoding (horizontal CID mapping) and font embedding. For multiple languages, you manage multiple fonts and detect which characters need which fonts.

IronPDF's Chromium engine handles this automatically. Specify UTF-8 in HTML, use web-safe fonts with Unicode support, and the engine manages encoding, font selection, and embedding internally.

How Do I Ensure Fonts Support International Characters?

Not all fonts include all Unicode characters. Arial includes Latin, Greek, Cyrillic, Arabic, and Hebrew but lacks Chinese/Japanese/Korean (CJK). "Noto Sans" (Google's font) supports most Unicode blocks. Specifying appropriate fonts in CSS ensures characters render.

Use font stacks with Unicode fallbacks:

body {
    font-family: Arial, 'Noto Sans', 'Segoe UI', sans-serif;
}
Enter fullscreen mode Exit fullscreen mode

The browser tries Arial first. If characters aren't in Arial (like Chinese), it falls back to Noto Sans. This cascading ensures coverage across languages without manually detecting character sets.

For CJK languages specifically, use CJK-specific fonts:

.chinese { font-family: 'Microsoft YaHei', 'SimSun', 'Noto Sans CJK SC', sans-serif; }
.japanese { font-family: 'Yu Gothic', 'Meiryo', 'Noto Sans CJK JP', sans-serif; }
.korean { font-family: 'Malgun Gothic', 'Noto Sans CJK KR', sans-serif; }
Enter fullscreen mode Exit fullscreen mode

Apply classes to elements containing CJK text. The fonts support the full character ranges needed.

I've generated multilingual invoices where customer names, product descriptions, and addresses all use different scripts. Font fallbacks handle mixing English headings with Chinese product names and Arabic addresses in single documents without font configuration code.

What About Right-to-Left Languages?

Arabic and Hebrew write right-to-left (RTL), requiring special layout handling. Characters flow rightward, and UI elements mirror. HTML supports RTL through the dir attribute and CSS direction property.

Create RTL content:

<p dir="rtl" style="direction: rtl; text-align: right;">
    مرحبا بكم في موقعنا
</p>
Enter fullscreen mode Exit fullscreen mode

The dir="rtl" attribute tells the browser to render text right-to-left. The CSS confirms direction and aligns text right. IronPDF's Chromium engine respects these directives, rendering RTL text correctly in PDFs.

For documents mixing LTR and RTL content:

<div>
    <p>English text flows left-to-right.</p>
    <p dir="rtl" style="direction: rtl;">النص العربي يتدفق من اليمين إلى اليسار.</p>
    <p>Back to English left-to-right.</p>
</div>
Enter fullscreen mode Exit fullscreen mode

Each paragraph specifies its direction independently. The PDF renders each correctly.

I've generated contracts for Middle Eastern clients where English legal terms alternate with Arabic clauses. Proper RTL handling ensures the Arabic sections render naturally without manual text reversal or layout hacks.

How Do I Test International Character Rendering?

Testing ensures characters render before deploying to production. The simplest approach: generate test PDFs and open them in Adobe Reader or browser PDF viewers.

Create a test document with sample text from target languages:

var testHtml = @"
<!DOCTYPE html>
<html>
<head>
    <meta charset=""UTF-8"">
    <style>
        body { font-family: Arial, 'Noto Sans', sans-serif; margin: 40px; }
        table { width: 100%; border-collapse: collapse; }
        td { padding: 10px; border: 1px solid #ddd; }
    </style>
</head>
<body>
    <h1>International Character Test</h1>
    <table>
        <tr><td>Chinese (Simplified)</td><td>你好世界 (Nǐ hǎo shìjiè)</td></tr>
        <tr><td>Chinese (Traditional)</td><td>你好世界 (Nǐ hǎo shìjiè)</td></tr>
        <tr><td>Japanese</td><td>こんにちは世界 (Kon'nichiwa sekai)</td></tr>
        <tr><td>Korean</td><td>안녕하세요 세계 (Annyeonghaseyo segye)</td></tr>
        <tr><td>Arabic</td><td dir=""rtl"">مرحبا بالعالم</td></tr>
        <tr><td>Hebrew</td><td dir=""rtl"">שלום עולם</td></tr>
        <tr><td>Russian</td><td>Привет мир (Privet mir)</td></tr>
        <tr><td>Greek</td><td>Γειά σου κόσμε (Geiá sou kósme)</td></tr>
        <tr><td>Thai</td><td>สวัสดีชาวโลก (S̄wạs̄dī chāw lok)</td></tr>
        <tr><td>Hindi</td><td>नमस्ते दुनिया (Namaste duniya)</td></tr>
        <tr><td>Emoji</td><td>👋 🌍 🎉 ✨ 💼</td></tr>
    </table>
</body>
</html>
";

var pdf = renderer.RenderHtmlAsPdf(testHtml);
pdf.SaveAs("character-test.pdf");
Enter fullscreen mode Exit fullscreen mode

Open the PDF. If characters appear as boxes or question marks, fonts are missing or encoding failed. If they render correctly, UTF-8 handling works.

For automated testing, extract text from PDFs and verify character preservation:

var generatedPdf = renderer.RenderHtmlAsPdf(html);
var extractedText = generatedPdf.ExtractAllText();

if (extractedText.Contains("你好"))
{
    Console.WriteLine("✓ Chinese characters preserved");
}
else
{
    Console.WriteLine("✗ Chinese characters lost");
}
Enter fullscreen mode Exit fullscreen mode

This confirms characters survived the HTML-to-PDF-to-text round-trip without corruption.

What About Server Font Requirements?

PDF generation happens server-side, meaning fonts must exist on the server. If your HTML references fonts not installed on the server, characters may render with fallback fonts or not at all.

For Windows servers, common Unicode fonts are pre-installed: Arial Unicode MS, Microsoft YaHei (Chinese), Malgun Gothic (Korean). For Linux servers, install font packages:

# Ubuntu/Debian
sudo apt-get install fonts-noto fonts-noto-cjk fonts-noto-color-emoji

# CentOS/RHEL
sudo yum install google-noto-sans-fonts google-noto-cjk-fonts
Enter fullscreen mode Exit fullscreen mode

Noto Sans provides comprehensive Unicode coverage. Installing it ensures fallback fonts work correctly.

Alternatively, embed fonts using Base64 in HTML:

<style>
    @font-face {
        font-family: 'CustomFont';
        src: url(data:font/woff2;base64,<BASE64_ENCODED_FONT>);
    }
    body { font-family: 'CustomFont', sans-serif; }
</style>
Enter fullscreen mode Exit fullscreen mode

This embeds fonts directly in HTML, eliminating server font dependencies. Useful for containerized deployments where font installation is inconvenient.

I've deployed PDF generation services to Docker containers with minimal font packages. Embedding critical fonts via Base64 ensured consistent rendering regardless of container base image font availability.

Quick Reference

Language Script Font Recommendation Additional CSS
Chinese CJK Microsoft YaHei, Noto Sans CJK SC None
Japanese CJK Yu Gothic, Noto Sans CJK JP None
Korean CJK Malgun Gothic, Noto Sans CJK KR None
Arabic RTL Arial, Noto Sans Arabic dir="rtl" style="direction:rtl"
Hebrew RTL Arial, Noto Sans Hebrew dir="rtl" style="direction:rtl"
Russian Cyrillic Arial, Noto Sans None
Greek Greek Arial, Noto Sans None
Thai Thai Arial, Noto Sans Thai None
Hindi Devanagari Arial Unicode MS, Noto Sans Devanagari None
Emoji Symbols Noto Color Emoji, Segoe UI Emoji None

Key Principles:

  • Include <meta charset="UTF-8"> in HTML head
  • Use font stacks with Unicode fallback fonts (Noto Sans)
  • For RTL languages (Arabic, Hebrew), add dir="rtl" and CSS direction:rtl
  • Test with sample characters from target languages before deployment
  • Server must have Unicode fonts installed (Noto Sans recommended)
  • Avoid Stack Overflow answers recommending manual BaseFont.IDENTITY_H setup (outdated, complex)
  • IronPDF's Chromium engine handles UTF-8 automatically with proper HTML charset

The complete UTF-8 PDF guide includes font embedding strategies and troubleshooting character encoding issues.


Written by Jacob Mellor, CTO at Iron Software. Jacob created IronPDF and leads a team of 50+ engineers building .NET document processing libraries.

Top comments (0)