IronSoftware

Posted on Mar 3

iText HTML to PDF Non-Latin Characters Not Working: Unicode Font (Issue Fixed)

#csharp #dotnet

Developers using iText or Flying Saucer (ITextRenderer) to convert HTML to PDF find that non-Latin characters (Chinese, Japanese, Korean, Arabic, Hebrew, Cyrillic, etc.) display as boxes, question marks, or disappear entirely. With over 27,000 views on Stack Overflow, this is one of the most common iText internationalization issues. The problem stems from font embedding requirements that are not immediately obvious. This article documents the issue and examines alternatives with automatic Unicode support.

The Problem

When converting HTML containing non-Latin characters to PDF with iText or Flying Saucer:

Characters display as empty boxes (□□□□)
Characters appear as question marks (????)
Characters are simply missing from output
Some characters render while others don't

The issue affects:

Chinese (Simplified and Traditional)
Japanese (Hiragana, Katakana, Kanji)
Korean (Hangul)
Arabic and Hebrew
Cyrillic (Russian, Ukrainian, etc.)
Greek
Thai, Vietnamese, Hindi, and many others

Error Messages and Symptoms

There are typically no error messages - characters simply don't render:

<!-- Input HTML -->
<p>Hello 世界 مرحبا Привет שלום</p>

<!-- Output shows -->
Hello □□ □□□□ □□□□□□ □□□□

Who Is Affected

This issue impacts any application with international content:

Industries: Global e-commerce, international business, education, government, any organization serving non-English speakers.

Content Types: Invoices with international addresses, multilingual documents, user-generated content, translated content.

Scale: 27,642 views on the primary Stack Overflow question indicates widespread impact.

Evidence from the Developer Community

Stack Overflow

Question	Views	Votes
Generation of PDF from HTML with non-Latin characters	27,642	18

Developer Reports

"I am from Czech Republic, and had same problem with our national symbols!"
— Developer, Stack Overflow, 2012

"Chinese characters are showing as boxes in the generated PDF."
— Developer, Stack Overflow, 2015

"Arabic text appears reversed and characters are disconnected."
— Developer, Stack Overflow, 2018

Root Cause Analysis

The issue stems from fundamental PDF font requirements that iText does not handle automatically.

How PDF Fonts Work

Unlike HTML documents where browsers have access to system fonts and web fonts, PDF files must embed font data:

PDF Font Embedding: Every font used in a PDF must be embedded within the file itself (or reference one of 14 standard PDF fonts)
Standard PDF Fonts: The 14 standard fonts (Helvetica, Times, Courier, etc.) only contain Latin characters and basic punctuation
Character Coverage: Each font file only contains specific characters - no single font covers all Unicode characters
Font Subsetting: PDF generators typically embed only the characters used, not entire font files

Why iText Shows Boxes or Question Marks

When iText processes HTML containing non-Latin characters:

It parses the HTML and encounters Chinese/Arabic/Cyrillic characters
It looks for a font that contains those characters
If no suitable font is configured, it falls back to a standard PDF font
The standard font doesn't have those characters
The PDF shows replacement characters: boxes (missing glyph) or question marks

Character Encoding Is Not the Problem

Many developers mistakenly think this is an encoding issue:

// This does NOT fix the problem
String html = new String(htmlBytes, "UTF-8");

The HTML encoding is typically correct. The problem is that the PDF generator doesn't have a font file containing the required characters.

Right-to-Left Language Complexity

Arabic and Hebrew present additional challenges:

Text Direction: Characters must render right-to-left
Character Shaping: Arabic letters change form based on position (initial, medial, final, isolated)
Ligatures: Certain character combinations should form connected shapes
Mixed Direction: Numbers within RTL text require bidirectional handling

iText requires specific configuration to handle these correctly, beyond just font embedding.

Attempted Workarounds

Workaround 1: Manual Font Registration

Approach: Register fonts that contain the required characters.

// Flying Saucer approach
ITextRenderer renderer = new ITextRenderer();
renderer.getFontResolver().addFont(
    "/path/to/NotoSansCJK.ttf",
    BaseFont.IDENTITY_H,
    BaseFont.EMBEDDED
);

// iText 7 approach
PdfFont font = PdfFontFactory.CreateFont(
    "path/to/font.ttf",
    PdfEncodings.IDENTITY_H,
    PdfFontFactory.EmbeddingStrategy.FORCE_EMBEDDED
);

Limitations:

Must know which fonts to include for each language
Fonts must be available on the server
Different fonts needed for different character sets
Complex font fallback chains required for mixed content
Licensing issues with some fonts

Workaround 2: Use Google Noto Fonts

Approach: Use Google's Noto font family which covers most Unicode characters.

// Register multiple Noto fonts for different scripts
IFontResolver fontResolver = renderer.getFontResolver();

// CJK (Chinese, Japanese, Korean)
fontResolver.addFont("fonts/NotoSansCJKsc-Regular.otf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

// Arabic
fontResolver.addFont("fonts/NotoSansArabic-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

// Hebrew
fontResolver.addFont("fonts/NotoSansHebrew-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

// Cyrillic
fontResolver.addFont("fonts/NotoSans-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

// Thai
fontResolver.addFont("fonts/NotoSansThai-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

// Devanagari (Hindi)
fontResolver.addFont("fonts/NotoSansDevanagari-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

// Vietnamese (extended Latin)
fontResolver.addFont("fonts/NotoSans-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

Font Files Required for Full Coverage:

Script	Noto Font File	Size
Latin Extended	NotoSans-Regular.ttf	~500KB
CJK (all)	NotoSansCJK-Regular.ttc	~120MB
Arabic	NotoSansArabic-Regular.ttf	~200KB
Hebrew	NotoSansHebrew-Regular.ttf	~100KB
Thai	NotoSansThai-Regular.ttf	~150KB
Devanagari	NotoSansDevanagari-Regular.ttf	~200KB
Greek	NotoSans-Regular.ttf	(included)
Cyrillic	NotoSans-Regular.ttf	(included)

Limitations:

Noto font collection is ~1GB for full coverage
Must register correct font for each script in your content
Font fallback logic still required for mixed content
Significant setup and deployment complexity
Docker/container deployments need fonts bundled

Workaround 3: Docker and Linux Font Configuration

When running in Docker or Linux environments:

# Dockerfile - Install fonts for PDF generation
FROM openjdk:11-jre-slim

# Install font packages
RUN apt-get update && apt-get install -y \
    fonts-noto-cjk \
    fonts-noto-core \
    fonts-noto-extra \
    fonts-freefont-ttf \
    && rm -rf /var/lib/apt/lists/*

# Copy application
COPY target/app.jar /app/app.jar

WORKDIR /app
CMD ["java", "-jar", "app.jar"]

Linux Font Paths:

// Linux font locations to check
String[] fontPaths = {
    "/usr/share/fonts/opentype/noto/",
    "/usr/share/fonts/truetype/noto/",
    "/usr/share/fonts/google-noto/",
    "/usr/local/share/fonts/"
};

for (String path : fontPaths) {
    File dir = new File(path);
    if (dir.exists()) {
        // Register fonts from this directory
    }
}

Limitations:

Adds significant container size (Noto CJK fonts are 100MB+)
Font paths vary between Linux distributions
Must configure iText to find installed fonts

Workaround 3: Convert Characters to Images

Approach: Render non-Latin text as images before PDF generation.

Limitations:

Text is not searchable or selectable
Significantly larger file sizes
Quality issues with scaling
Complex implementation

Troubleshooting: "My Characters Still Show as Boxes"

If you've registered fonts but characters still don't render:

1. Verify Font Contains Required Characters

// Check if font file contains specific characters
Font font = Font.createFont(Font.TRUETYPE_FONT, new File("NotoSansCJK.ttf"));
boolean hasCharacter = font.canDisplay('\u4e2d'); // Chinese character
System.out.println("Font supports character: " + hasCharacter);

2. Check Font Registration Order

iText uses the first registered font that contains a character. Register specific fonts before generic ones:

// Register CJK font first for Asian character priority
resolver.addFont("NotoSansCJK.ttf", ...);
// Then register Latin font
resolver.addFont("NotoSans.ttf", ...);

3. Verify Encoding Settings

// Must use IDENTITY_H for Unicode
PdfFontFactory.createFont(fontPath, PdfEncodings.IDENTITY_H,
    PdfFontFactory.EmbeddingStrategy.FORCE_EMBEDDED);

4. Check HTML Character Encoding

<!-- Ensure HTML declares UTF-8 -->
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
</head>

A Different Approach: IronPDF

IronPDF uses Chromium's font system, which automatically handles font fallback for all Unicode characters.

Why IronPDF Handles Unicode Automatically

Chromium's text rendering:

Automatically detects required character sets
Uses system font fallback chains
Supports CSS @font-face for web fonts
Handles mixed-language content seamlessly

No manual font registration is required.

Code Example

using IronPdf;

public class MultilingualPdfGenerator
{
    public byte[] GenerateMultilingualDocument()
    {
        var renderer = new ChromePdfRenderer();

        // All scripts render automatically - no font configuration needed
        string html = @"
<!DOCTYPE html>
<html>
<head>
    <meta charset='UTF-8'>
    <style>
        body {
            font-family: 'Segoe UI', 'Noto Sans', 'Arial Unicode MS', sans-serif;
            font-size: 14px;
            line-height: 1.8;
        }
        h2 { color: #333; margin-top: 30px; }
        .sample {
            padding: 15px;
            background: #f5f5f5;
            border-radius: 8px;
            margin: 10px 0;
        }
    </style>
</head>
<body>
    <h1>Multilingual Document</h1>

    <h2>Chinese (简体中文)</h2>
    <div class='sample'>
        这是一个测试文档。欢迎使用我们的服务。
    </div>

    <h2>Japanese (日本語)</h2>
    <div class='sample'>
        これはテスト文書です。サービスへようこそ。
    </div>

    <h2>Korean (한국어)</h2>
    <div class='sample'>
        이것은 테스트 문서입니다. 서비스를 이용해 주셔서 감사합니다.
    </div>

    <h2>Arabic (العربية)</h2>
    <div class='sample' dir='rtl'>
        هذا مستند اختبار. مرحبا بكم في خدمتنا.
    </div>

    <h2>Hebrew (עברית)</h2>
    <div class='sample' dir='rtl'>
        זהו מסמך בדיקה. ברוכים הבאים לשירות שלנו.
    </div>

    <h2>Russian (Русский)</h2>
    <div class='sample'>
        Это тестовый документ. Добро пожаловать в наш сервис.
    </div>

    <h2>Thai (ภาษาไทย)</h2>
    <div class='sample'>
        นี่คือเอกสารทดสอบ ยินดีต้อนรับสู่บริการของเรา
    </div>

    <h2>Hindi (हिन्दी)</h2>
    <div class='sample'>
        यह एक परीक्षण दस्तावेज़ है। हमारी सेवा में आपका स्वागत है।
    </div>

    <h2>Mixed Content</h2>
    <div class='sample'>
        Welcome 欢迎 ようこそ 환영합니다 مرحبا Добро пожаловать
    </div>
</body>
</html>";

        using var pdf = renderer.RenderHtmlAsPdf(html);
        return pdf.BinaryData;
    }

    public byte[] GenerateInternationalInvoice(InvoiceData invoice)
    {
        var renderer = new ChromePdfRenderer();

        string html = $@"
<!DOCTYPE html>
<html>
<head>
    <meta charset='UTF-8'>
    <style>
        body {{ font-family: system-ui, sans-serif; }}
        .address {{ white-space: pre-line; }}
    </style>
</head>
<body>
    <h1>Invoice / 发票 / 請求書</h1>

    <div class='address'>
        {invoice.CustomerName}
        {invoice.Address}
    </div>

    <!-- Address with Chinese/Japanese characters works automatically -->
</body>
</html>";

        using var pdf = renderer.RenderHtmlAsPdf(html);
        return pdf.BinaryData;
    }
}

Key points:

UTF-8 HTML with any script works automatically
No font registration required
Right-to-left languages (Arabic, Hebrew) handle correctly
Mixed-language content renders properly

Additional Language Examples

public byte[] GenerateVietnameseDocument()
{
    var renderer = new ChromePdfRenderer();

    // Vietnamese uses Latin script with extensive diacritics
    string html = @"
<html>
<head><meta charset='UTF-8'></head>
<body>
    <h1>Tiếng Việt</h1>
    <p>Xin chào! Chào mừng bạn đến với dịch vụ của chúng tôi.</p>
    <p>Đây là một tài liệu thử nghiệm bằng tiếng Việt.</p>
</body>
</html>";

    using var pdf = renderer.RenderHtmlAsPdf(html);
    return pdf.BinaryData;
}

public byte[] GenerateGreekDocument()
{
    var renderer = new ChromePdfRenderer();

    string html = @"
<html>
<head><meta charset='UTF-8'></head>
<body>
    <h1>Ελληνικά</h1>
    <p>Γεια σας! Καλώς ήρθατε στην υπηρεσία μας.</p>
    <p>Αυτό είναι ένα δοκιμαστικό έγγραφο στα ελληνικά.</p>
</body>
</html>";

    using var pdf = renderer.RenderHtmlAsPdf(html);
    return pdf.BinaryData;
}

public byte[] GenerateComplexMixedDocument()
{
    var renderer = new ChromePdfRenderer();

    // Document with multiple scripts and directions
    string html = @"
<html>
<head>
    <meta charset='UTF-8'>
    <style>
        body { font-family: system-ui, sans-serif; }
        .rtl { direction: rtl; text-align: right; }
        table { width: 100%; border-collapse: collapse; }
        td { padding: 10px; border: 1px solid #ccc; }
    </style>
</head>
<body>
    <h1>International Customer List</h1>
    <table>
        <tr>
            <td>Japan</td>
            <td>田中太郎</td>
            <td>東京都港区</td>
        </tr>
        <tr>
            <td>China</td>
            <td>李明</td>
            <td>北京市朝阳区</td>
        </tr>
        <tr>
            <td>UAE</td>
            <td class='rtl'>محمد أحمد</td>
            <td class='rtl'>دبي، الإمارات</td>
        </tr>
        <tr>
            <td>Israel</td>
            <td class='rtl'>יוסי כהן</td>
            <td class='rtl'>תל אביב</td>
        </tr>
        <tr>
            <td>Russia</td>
            <td>Иван Петров</td>
            <td>Москва</td>
        </tr>
        <tr>
            <td>Thailand</td>
            <td>สมชาย ใจดี</td>
            <td>กรุงเทพมหานคร</td>
        </tr>
    </table>
</body>
</html>";

    using var pdf = renderer.RenderHtmlAsPdf(html);
    return pdf.BinaryData;
}

API Reference

Migration Considerations

Licensing

iText has AGPL or commercial licensing
IronPDF is commercial with perpetual licensing
IronPDF Licensing

What You Gain

Automatic Unicode support for all languages
No font registration or configuration
RTL language support built-in
Same results as browser rendering

What to Consider

Different library, different API
Chromium-based vs custom renderer
Commercial licensing

Conclusion

iText's non-Latin character issues require significant manual configuration that is error-prone and hard to maintain across all possible character sets. Browser-based rendering engines handle Unicode automatically through system font fallback, providing reliable international text support without manual font management.

Written by Jacob Mellor, who leads technical development at Iron Software.

References

Stack Overflow: Generation of PDF from HTML with non-Latin characters{:rel="nofollow"} - 27K+ views

For the latest IronPDF documentation and tutorials, visit ironpdf.com.

DEV Community

iText HTML to PDF Non-Latin Characters Not Working: Unicode Font (Issue Fixed)

The Problem

Error Messages and Symptoms

Who Is Affected

Evidence from the Developer Community

Stack Overflow

Developer Reports

Root Cause Analysis

How PDF Fonts Work

Why iText Shows Boxes or Question Marks

Character Encoding Is Not the Problem

Right-to-Left Language Complexity

Attempted Workarounds

Workaround 1: Manual Font Registration

Workaround 2: Use Google Noto Fonts

Workaround 3: Docker and Linux Font Configuration

Workaround 3: Convert Characters to Images

Troubleshooting: "My Characters Still Show as Boxes"

A Different Approach: IronPDF

Why IronPDF Handles Unicode Automatically

Code Example

Additional Language Examples

API Reference

Migration Considerations

Licensing

What You Gain

What to Consider

Conclusion

References

Top comments (0)