DEV Community

IronSoftware
IronSoftware

Posted on

iText HTML to PDF Non-Latin Characters Not Working: Unicode Font (Issue Fixed)

Developers using iText or Flying Saucer (ITextRenderer) to convert HTML to PDF find that non-Latin characters (Chinese, Japanese, Korean, Arabic, Hebrew, Cyrillic, etc.) display as boxes, question marks, or disappear entirely. With over 27,000 views on Stack Overflow, this is one of the most common iText internationalization issues. The problem stems from font embedding requirements that are not immediately obvious. This article documents the issue and examines alternatives with automatic Unicode support.

The Problem

When converting HTML containing non-Latin characters to PDF with iText or Flying Saucer:

  • Characters display as empty boxes (□□□□)
  • Characters appear as question marks (????)
  • Characters are simply missing from output
  • Some characters render while others don't

The issue affects:

  • Chinese (Simplified and Traditional)
  • Japanese (Hiragana, Katakana, Kanji)
  • Korean (Hangul)
  • Arabic and Hebrew
  • Cyrillic (Russian, Ukrainian, etc.)
  • Greek
  • Thai, Vietnamese, Hindi, and many others

Error Messages and Symptoms

There are typically no error messages - characters simply don't render:

<!-- Input HTML -->
<p>Hello 世界 مرحبا Привет שלום</p>

<!-- Output shows -->
Hello □□ □□□□ □□□□□□ □□□□
Enter fullscreen mode Exit fullscreen mode

Who Is Affected

This issue impacts any application with international content:

Industries: Global e-commerce, international business, education, government, any organization serving non-English speakers.

Content Types: Invoices with international addresses, multilingual documents, user-generated content, translated content.

Scale: 27,642 views on the primary Stack Overflow question indicates widespread impact.

Evidence from the Developer Community

Stack Overflow

Question Views Votes
Generation of PDF from HTML with non-Latin characters 27,642 18

Developer Reports

"I am from Czech Republic, and had same problem with our national symbols!"
— Developer, Stack Overflow, 2012

"Chinese characters are showing as boxes in the generated PDF."
— Developer, Stack Overflow, 2015

"Arabic text appears reversed and characters are disconnected."
— Developer, Stack Overflow, 2018

Root Cause Analysis

The issue stems from fundamental PDF font requirements that iText does not handle automatically.

How PDF Fonts Work

Unlike HTML documents where browsers have access to system fonts and web fonts, PDF files must embed font data:

  1. PDF Font Embedding: Every font used in a PDF must be embedded within the file itself (or reference one of 14 standard PDF fonts)
  2. Standard PDF Fonts: The 14 standard fonts (Helvetica, Times, Courier, etc.) only contain Latin characters and basic punctuation
  3. Character Coverage: Each font file only contains specific characters - no single font covers all Unicode characters
  4. Font Subsetting: PDF generators typically embed only the characters used, not entire font files

Why iText Shows Boxes or Question Marks

When iText processes HTML containing non-Latin characters:

  1. It parses the HTML and encounters Chinese/Arabic/Cyrillic characters
  2. It looks for a font that contains those characters
  3. If no suitable font is configured, it falls back to a standard PDF font
  4. The standard font doesn't have those characters
  5. The PDF shows replacement characters: boxes (missing glyph) or question marks

Character Encoding Is Not the Problem

Many developers mistakenly think this is an encoding issue:

// This does NOT fix the problem
String html = new String(htmlBytes, "UTF-8");
Enter fullscreen mode Exit fullscreen mode

The HTML encoding is typically correct. The problem is that the PDF generator doesn't have a font file containing the required characters.

Right-to-Left Language Complexity

Arabic and Hebrew present additional challenges:

  1. Text Direction: Characters must render right-to-left
  2. Character Shaping: Arabic letters change form based on position (initial, medial, final, isolated)
  3. Ligatures: Certain character combinations should form connected shapes
  4. Mixed Direction: Numbers within RTL text require bidirectional handling

iText requires specific configuration to handle these correctly, beyond just font embedding.

Attempted Workarounds

Workaround 1: Manual Font Registration

Approach: Register fonts that contain the required characters.

// Flying Saucer approach
ITextRenderer renderer = new ITextRenderer();
renderer.getFontResolver().addFont(
    "/path/to/NotoSansCJK.ttf",
    BaseFont.IDENTITY_H,
    BaseFont.EMBEDDED
);
Enter fullscreen mode Exit fullscreen mode
// iText 7 approach
PdfFont font = PdfFontFactory.CreateFont(
    "path/to/font.ttf",
    PdfEncodings.IDENTITY_H,
    PdfFontFactory.EmbeddingStrategy.FORCE_EMBEDDED
);
Enter fullscreen mode Exit fullscreen mode

Limitations:

  • Must know which fonts to include for each language
  • Fonts must be available on the server
  • Different fonts needed for different character sets
  • Complex font fallback chains required for mixed content
  • Licensing issues with some fonts

Workaround 2: Use Google Noto Fonts

Approach: Use Google's Noto font family which covers most Unicode characters.

// Register multiple Noto fonts for different scripts
IFontResolver fontResolver = renderer.getFontResolver();

// CJK (Chinese, Japanese, Korean)
fontResolver.addFont("fonts/NotoSansCJKsc-Regular.otf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

// Arabic
fontResolver.addFont("fonts/NotoSansArabic-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

// Hebrew
fontResolver.addFont("fonts/NotoSansHebrew-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

// Cyrillic
fontResolver.addFont("fonts/NotoSans-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

// Thai
fontResolver.addFont("fonts/NotoSansThai-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

// Devanagari (Hindi)
fontResolver.addFont("fonts/NotoSansDevanagari-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);

// Vietnamese (extended Latin)
fontResolver.addFont("fonts/NotoSans-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
Enter fullscreen mode Exit fullscreen mode

Font Files Required for Full Coverage:

Script Noto Font File Size
Latin Extended NotoSans-Regular.ttf ~500KB
CJK (all) NotoSansCJK-Regular.ttc ~120MB
Arabic NotoSansArabic-Regular.ttf ~200KB
Hebrew NotoSansHebrew-Regular.ttf ~100KB
Thai NotoSansThai-Regular.ttf ~150KB
Devanagari NotoSansDevanagari-Regular.ttf ~200KB
Greek NotoSans-Regular.ttf (included)
Cyrillic NotoSans-Regular.ttf (included)

Limitations:

  • Noto font collection is ~1GB for full coverage
  • Must register correct font for each script in your content
  • Font fallback logic still required for mixed content
  • Significant setup and deployment complexity
  • Docker/container deployments need fonts bundled

Workaround 3: Docker and Linux Font Configuration

When running in Docker or Linux environments:

# Dockerfile - Install fonts for PDF generation
FROM openjdk:11-jre-slim

# Install font packages
RUN apt-get update && apt-get install -y \
    fonts-noto-cjk \
    fonts-noto-core \
    fonts-noto-extra \
    fonts-freefont-ttf \
    && rm -rf /var/lib/apt/lists/*

# Copy application
COPY target/app.jar /app/app.jar

WORKDIR /app
CMD ["java", "-jar", "app.jar"]
Enter fullscreen mode Exit fullscreen mode

Linux Font Paths:

// Linux font locations to check
String[] fontPaths = {
    "/usr/share/fonts/opentype/noto/",
    "/usr/share/fonts/truetype/noto/",
    "/usr/share/fonts/google-noto/",
    "/usr/local/share/fonts/"
};

for (String path : fontPaths) {
    File dir = new File(path);
    if (dir.exists()) {
        // Register fonts from this directory
    }
}
Enter fullscreen mode Exit fullscreen mode

Limitations:

  • Adds significant container size (Noto CJK fonts are 100MB+)
  • Font paths vary between Linux distributions
  • Must configure iText to find installed fonts

Workaround 3: Convert Characters to Images

Approach: Render non-Latin text as images before PDF generation.

Limitations:

  • Text is not searchable or selectable
  • Significantly larger file sizes
  • Quality issues with scaling
  • Complex implementation

Troubleshooting: "My Characters Still Show as Boxes"

If you've registered fonts but characters still don't render:

1. Verify Font Contains Required Characters

// Check if font file contains specific characters
Font font = Font.createFont(Font.TRUETYPE_FONT, new File("NotoSansCJK.ttf"));
boolean hasCharacter = font.canDisplay('\u4e2d'); // Chinese character
System.out.println("Font supports character: " + hasCharacter);
Enter fullscreen mode Exit fullscreen mode

2. Check Font Registration Order

iText uses the first registered font that contains a character. Register specific fonts before generic ones:

// Register CJK font first for Asian character priority
resolver.addFont("NotoSansCJK.ttf", ...);
// Then register Latin font
resolver.addFont("NotoSans.ttf", ...);
Enter fullscreen mode Exit fullscreen mode

3. Verify Encoding Settings

// Must use IDENTITY_H for Unicode
PdfFontFactory.createFont(fontPath, PdfEncodings.IDENTITY_H,
    PdfFontFactory.EmbeddingStrategy.FORCE_EMBEDDED);
Enter fullscreen mode Exit fullscreen mode

4. Check HTML Character Encoding

<!-- Ensure HTML declares UTF-8 -->
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
</head>
Enter fullscreen mode Exit fullscreen mode

A Different Approach: IronPDF

IronPDF uses Chromium's font system, which automatically handles font fallback for all Unicode characters.

Why IronPDF Handles Unicode Automatically

Chromium's text rendering:

  1. Automatically detects required character sets
  2. Uses system font fallback chains
  3. Supports CSS @font-face for web fonts
  4. Handles mixed-language content seamlessly

No manual font registration is required.

Code Example

using IronPdf;

public class MultilingualPdfGenerator
{
    public byte[] GenerateMultilingualDocument()
    {
        var renderer = new ChromePdfRenderer();

        // All scripts render automatically - no font configuration needed
        string html = @"
<!DOCTYPE html>
<html>
<head>
    <meta charset='UTF-8'>
    <style>
        body {
            font-family: 'Segoe UI', 'Noto Sans', 'Arial Unicode MS', sans-serif;
            font-size: 14px;
            line-height: 1.8;
        }
        h2 { color: #333; margin-top: 30px; }
        .sample {
            padding: 15px;
            background: #f5f5f5;
            border-radius: 8px;
            margin: 10px 0;
        }
    </style>
</head>
<body>
    <h1>Multilingual Document</h1>

    <h2>Chinese (简体中文)</h2>
    <div class='sample'>
        这是一个测试文档。欢迎使用我们的服务。
    </div>

    <h2>Japanese (日本語)</h2>
    <div class='sample'>
        これはテスト文書です。サービスへようこそ。
    </div>

    <h2>Korean (한국어)</h2>
    <div class='sample'>
        이것은 테스트 문서입니다. 서비스를 이용해 주셔서 감사합니다.
    </div>

    <h2>Arabic (العربية)</h2>
    <div class='sample' dir='rtl'>
        هذا مستند اختبار. مرحبا بكم في خدمتنا.
    </div>

    <h2>Hebrew (עברית)</h2>
    <div class='sample' dir='rtl'>
        זהו מסמך בדיקה. ברוכים הבאים לשירות שלנו.
    </div>

    <h2>Russian (Русский)</h2>
    <div class='sample'>
        Это тестовый документ. Добро пожаловать в наш сервис.
    </div>

    <h2>Thai (ภาษาไทย)</h2>
    <div class='sample'>
        นี่คือเอกสารทดสอบ ยินดีต้อนรับสู่บริการของเรา
    </div>

    <h2>Hindi (हिन्दी)</h2>
    <div class='sample'>
        यह एक परीक्षण दस्तावेज़ है। हमारी सेवा में आपका स्वागत है।
    </div>

    <h2>Mixed Content</h2>
    <div class='sample'>
        Welcome 欢迎 ようこそ 환영합니다 مرحبا Добро пожаловать
    </div>
</body>
</html>";

        using var pdf = renderer.RenderHtmlAsPdf(html);
        return pdf.BinaryData;
    }

    public byte[] GenerateInternationalInvoice(InvoiceData invoice)
    {
        var renderer = new ChromePdfRenderer();

        string html = $@"
<!DOCTYPE html>
<html>
<head>
    <meta charset='UTF-8'>
    <style>
        body {{ font-family: system-ui, sans-serif; }}
        .address {{ white-space: pre-line; }}
    </style>
</head>
<body>
    <h1>Invoice / 发票 / 請求書</h1>

    <div class='address'>
        {invoice.CustomerName}
        {invoice.Address}
    </div>

    <!-- Address with Chinese/Japanese characters works automatically -->
</body>
</html>";

        using var pdf = renderer.RenderHtmlAsPdf(html);
        return pdf.BinaryData;
    }
}
Enter fullscreen mode Exit fullscreen mode

Key points:

  • UTF-8 HTML with any script works automatically
  • No font registration required
  • Right-to-left languages (Arabic, Hebrew) handle correctly
  • Mixed-language content renders properly

Additional Language Examples

public byte[] GenerateVietnameseDocument()
{
    var renderer = new ChromePdfRenderer();

    // Vietnamese uses Latin script with extensive diacritics
    string html = @"
<html>
<head><meta charset='UTF-8'></head>
<body>
    <h1>Tiếng Việt</h1>
    <p>Xin chào! Chào mừng bạn đến với dịch vụ của chúng tôi.</p>
    <p>Đây là một tài liệu thử nghiệm bằng tiếng Việt.</p>
</body>
</html>";

    using var pdf = renderer.RenderHtmlAsPdf(html);
    return pdf.BinaryData;
}

public byte[] GenerateGreekDocument()
{
    var renderer = new ChromePdfRenderer();

    string html = @"
<html>
<head><meta charset='UTF-8'></head>
<body>
    <h1>Ελληνικά</h1>
    <p>Γεια σας! Καλώς ήρθατε στην υπηρεσία μας.</p>
    <p>Αυτό είναι ένα δοκιμαστικό έγγραφο στα ελληνικά.</p>
</body>
</html>";

    using var pdf = renderer.RenderHtmlAsPdf(html);
    return pdf.BinaryData;
}

public byte[] GenerateComplexMixedDocument()
{
    var renderer = new ChromePdfRenderer();

    // Document with multiple scripts and directions
    string html = @"
<html>
<head>
    <meta charset='UTF-8'>
    <style>
        body { font-family: system-ui, sans-serif; }
        .rtl { direction: rtl; text-align: right; }
        table { width: 100%; border-collapse: collapse; }
        td { padding: 10px; border: 1px solid #ccc; }
    </style>
</head>
<body>
    <h1>International Customer List</h1>
    <table>
        <tr>
            <td>Japan</td>
            <td>田中太郎</td>
            <td>東京都港区</td>
        </tr>
        <tr>
            <td>China</td>
            <td>李明</td>
            <td>北京市朝阳区</td>
        </tr>
        <tr>
            <td>UAE</td>
            <td class='rtl'>محمد أحمد</td>
            <td class='rtl'>دبي، الإمارات</td>
        </tr>
        <tr>
            <td>Israel</td>
            <td class='rtl'>יוסי כהן</td>
            <td class='rtl'>תל אביב</td>
        </tr>
        <tr>
            <td>Russia</td>
            <td>Иван Петров</td>
            <td>Москва</td>
        </tr>
        <tr>
            <td>Thailand</td>
            <td>สมชาย ใจดี</td>
            <td>กรุงเทพมหานคร</td>
        </tr>
    </table>
</body>
</html>";

    using var pdf = renderer.RenderHtmlAsPdf(html);
    return pdf.BinaryData;
}
Enter fullscreen mode Exit fullscreen mode

API Reference

Migration Considerations

Licensing

  • iText has AGPL or commercial licensing
  • IronPDF is commercial with perpetual licensing
  • IronPDF Licensing

What You Gain

  • Automatic Unicode support for all languages
  • No font registration or configuration
  • RTL language support built-in
  • Same results as browser rendering

What to Consider

  • Different library, different API
  • Chromium-based vs custom renderer
  • Commercial licensing

Conclusion

iText's non-Latin character issues require significant manual configuration that is error-prone and hard to maintain across all possible character sets. Browser-based rendering engines handle Unicode automatically through system font fallback, providing reliable international text support without manual font management.


Written by Jacob Mellor, who leads technical development at Iron Software.


References

  1. Stack Overflow: Generation of PDF from HTML with non-Latin characters{:rel="nofollow"} - 27K+ views

For the latest IronPDF documentation and tutorials, visit ironpdf.com.

Top comments (0)