Developers using iText or Flying Saucer (ITextRenderer) to convert HTML to PDF find that non-Latin characters (Chinese, Japanese, Korean, Arabic, Hebrew, Cyrillic, etc.) display as boxes, question marks, or disappear entirely. With over 27,000 views on Stack Overflow, this is one of the most common iText internationalization issues. The problem stems from font embedding requirements that are not immediately obvious. This article documents the issue and examines alternatives with automatic Unicode support.
The Problem
When converting HTML containing non-Latin characters to PDF with iText or Flying Saucer:
- Characters display as empty boxes (□□□□)
- Characters appear as question marks (????)
- Characters are simply missing from output
- Some characters render while others don't
The issue affects:
- Chinese (Simplified and Traditional)
- Japanese (Hiragana, Katakana, Kanji)
- Korean (Hangul)
- Arabic and Hebrew
- Cyrillic (Russian, Ukrainian, etc.)
- Greek
- Thai, Vietnamese, Hindi, and many others
Error Messages and Symptoms
There are typically no error messages - characters simply don't render:
<!-- Input HTML -->
<p>Hello 世界 مرحبا Привет שלום</p>
<!-- Output shows -->
Hello □□ □□□□ □□□□□□ □□□□
Who Is Affected
This issue impacts any application with international content:
Industries: Global e-commerce, international business, education, government, any organization serving non-English speakers.
Content Types: Invoices with international addresses, multilingual documents, user-generated content, translated content.
Scale: 27,642 views on the primary Stack Overflow question indicates widespread impact.
Evidence from the Developer Community
Stack Overflow
| Question | Views | Votes |
|---|---|---|
| Generation of PDF from HTML with non-Latin characters | 27,642 | 18 |
Developer Reports
"I am from Czech Republic, and had same problem with our national symbols!"
— Developer, Stack Overflow, 2012"Chinese characters are showing as boxes in the generated PDF."
— Developer, Stack Overflow, 2015"Arabic text appears reversed and characters are disconnected."
— Developer, Stack Overflow, 2018
Root Cause Analysis
The issue stems from fundamental PDF font requirements that iText does not handle automatically.
How PDF Fonts Work
Unlike HTML documents where browsers have access to system fonts and web fonts, PDF files must embed font data:
- PDF Font Embedding: Every font used in a PDF must be embedded within the file itself (or reference one of 14 standard PDF fonts)
- Standard PDF Fonts: The 14 standard fonts (Helvetica, Times, Courier, etc.) only contain Latin characters and basic punctuation
- Character Coverage: Each font file only contains specific characters - no single font covers all Unicode characters
- Font Subsetting: PDF generators typically embed only the characters used, not entire font files
Why iText Shows Boxes or Question Marks
When iText processes HTML containing non-Latin characters:
- It parses the HTML and encounters Chinese/Arabic/Cyrillic characters
- It looks for a font that contains those characters
- If no suitable font is configured, it falls back to a standard PDF font
- The standard font doesn't have those characters
- The PDF shows replacement characters: boxes (missing glyph) or question marks
Character Encoding Is Not the Problem
Many developers mistakenly think this is an encoding issue:
// This does NOT fix the problem
String html = new String(htmlBytes, "UTF-8");
The HTML encoding is typically correct. The problem is that the PDF generator doesn't have a font file containing the required characters.
Right-to-Left Language Complexity
Arabic and Hebrew present additional challenges:
- Text Direction: Characters must render right-to-left
- Character Shaping: Arabic letters change form based on position (initial, medial, final, isolated)
- Ligatures: Certain character combinations should form connected shapes
- Mixed Direction: Numbers within RTL text require bidirectional handling
iText requires specific configuration to handle these correctly, beyond just font embedding.
Attempted Workarounds
Workaround 1: Manual Font Registration
Approach: Register fonts that contain the required characters.
// Flying Saucer approach
ITextRenderer renderer = new ITextRenderer();
renderer.getFontResolver().addFont(
"/path/to/NotoSansCJK.ttf",
BaseFont.IDENTITY_H,
BaseFont.EMBEDDED
);
// iText 7 approach
PdfFont font = PdfFontFactory.CreateFont(
"path/to/font.ttf",
PdfEncodings.IDENTITY_H,
PdfFontFactory.EmbeddingStrategy.FORCE_EMBEDDED
);
Limitations:
- Must know which fonts to include for each language
- Fonts must be available on the server
- Different fonts needed for different character sets
- Complex font fallback chains required for mixed content
- Licensing issues with some fonts
Workaround 2: Use Google Noto Fonts
Approach: Use Google's Noto font family which covers most Unicode characters.
// Register multiple Noto fonts for different scripts
IFontResolver fontResolver = renderer.getFontResolver();
// CJK (Chinese, Japanese, Korean)
fontResolver.addFont("fonts/NotoSansCJKsc-Regular.otf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
// Arabic
fontResolver.addFont("fonts/NotoSansArabic-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
// Hebrew
fontResolver.addFont("fonts/NotoSansHebrew-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
// Cyrillic
fontResolver.addFont("fonts/NotoSans-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
// Thai
fontResolver.addFont("fonts/NotoSansThai-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
// Devanagari (Hindi)
fontResolver.addFont("fonts/NotoSansDevanagari-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
// Vietnamese (extended Latin)
fontResolver.addFont("fonts/NotoSans-Regular.ttf", BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
Font Files Required for Full Coverage:
| Script | Noto Font File | Size |
|---|---|---|
| Latin Extended | NotoSans-Regular.ttf | ~500KB |
| CJK (all) | NotoSansCJK-Regular.ttc | ~120MB |
| Arabic | NotoSansArabic-Regular.ttf | ~200KB |
| Hebrew | NotoSansHebrew-Regular.ttf | ~100KB |
| Thai | NotoSansThai-Regular.ttf | ~150KB |
| Devanagari | NotoSansDevanagari-Regular.ttf | ~200KB |
| Greek | NotoSans-Regular.ttf | (included) |
| Cyrillic | NotoSans-Regular.ttf | (included) |
Limitations:
- Noto font collection is ~1GB for full coverage
- Must register correct font for each script in your content
- Font fallback logic still required for mixed content
- Significant setup and deployment complexity
- Docker/container deployments need fonts bundled
Workaround 3: Docker and Linux Font Configuration
When running in Docker or Linux environments:
# Dockerfile - Install fonts for PDF generation
FROM openjdk:11-jre-slim
# Install font packages
RUN apt-get update && apt-get install -y \
fonts-noto-cjk \
fonts-noto-core \
fonts-noto-extra \
fonts-freefont-ttf \
&& rm -rf /var/lib/apt/lists/*
# Copy application
COPY target/app.jar /app/app.jar
WORKDIR /app
CMD ["java", "-jar", "app.jar"]
Linux Font Paths:
// Linux font locations to check
String[] fontPaths = {
"/usr/share/fonts/opentype/noto/",
"/usr/share/fonts/truetype/noto/",
"/usr/share/fonts/google-noto/",
"/usr/local/share/fonts/"
};
for (String path : fontPaths) {
File dir = new File(path);
if (dir.exists()) {
// Register fonts from this directory
}
}
Limitations:
- Adds significant container size (Noto CJK fonts are 100MB+)
- Font paths vary between Linux distributions
- Must configure iText to find installed fonts
Workaround 3: Convert Characters to Images
Approach: Render non-Latin text as images before PDF generation.
Limitations:
- Text is not searchable or selectable
- Significantly larger file sizes
- Quality issues with scaling
- Complex implementation
Troubleshooting: "My Characters Still Show as Boxes"
If you've registered fonts but characters still don't render:
1. Verify Font Contains Required Characters
// Check if font file contains specific characters
Font font = Font.createFont(Font.TRUETYPE_FONT, new File("NotoSansCJK.ttf"));
boolean hasCharacter = font.canDisplay('\u4e2d'); // Chinese character
System.out.println("Font supports character: " + hasCharacter);
2. Check Font Registration Order
iText uses the first registered font that contains a character. Register specific fonts before generic ones:
// Register CJK font first for Asian character priority
resolver.addFont("NotoSansCJK.ttf", ...);
// Then register Latin font
resolver.addFont("NotoSans.ttf", ...);
3. Verify Encoding Settings
// Must use IDENTITY_H for Unicode
PdfFontFactory.createFont(fontPath, PdfEncodings.IDENTITY_H,
PdfFontFactory.EmbeddingStrategy.FORCE_EMBEDDED);
4. Check HTML Character Encoding
<!-- Ensure HTML declares UTF-8 -->
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
</head>
A Different Approach: IronPDF
IronPDF uses Chromium's font system, which automatically handles font fallback for all Unicode characters.
Why IronPDF Handles Unicode Automatically
Chromium's text rendering:
- Automatically detects required character sets
- Uses system font fallback chains
- Supports CSS
@font-facefor web fonts - Handles mixed-language content seamlessly
No manual font registration is required.
Code Example
using IronPdf;
public class MultilingualPdfGenerator
{
public byte[] GenerateMultilingualDocument()
{
var renderer = new ChromePdfRenderer();
// All scripts render automatically - no font configuration needed
string html = @"
<!DOCTYPE html>
<html>
<head>
<meta charset='UTF-8'>
<style>
body {
font-family: 'Segoe UI', 'Noto Sans', 'Arial Unicode MS', sans-serif;
font-size: 14px;
line-height: 1.8;
}
h2 { color: #333; margin-top: 30px; }
.sample {
padding: 15px;
background: #f5f5f5;
border-radius: 8px;
margin: 10px 0;
}
</style>
</head>
<body>
<h1>Multilingual Document</h1>
<h2>Chinese (简体中文)</h2>
<div class='sample'>
这是一个测试文档。欢迎使用我们的服务。
</div>
<h2>Japanese (日本語)</h2>
<div class='sample'>
これはテスト文書です。サービスへようこそ。
</div>
<h2>Korean (한국어)</h2>
<div class='sample'>
이것은 테스트 문서입니다. 서비스를 이용해 주셔서 감사합니다.
</div>
<h2>Arabic (العربية)</h2>
<div class='sample' dir='rtl'>
هذا مستند اختبار. مرحبا بكم في خدمتنا.
</div>
<h2>Hebrew (עברית)</h2>
<div class='sample' dir='rtl'>
זהו מסמך בדיקה. ברוכים הבאים לשירות שלנו.
</div>
<h2>Russian (Русский)</h2>
<div class='sample'>
Это тестовый документ. Добро пожаловать в наш сервис.
</div>
<h2>Thai (ภาษาไทย)</h2>
<div class='sample'>
นี่คือเอกสารทดสอบ ยินดีต้อนรับสู่บริการของเรา
</div>
<h2>Hindi (हिन्दी)</h2>
<div class='sample'>
यह एक परीक्षण दस्तावेज़ है। हमारी सेवा में आपका स्वागत है।
</div>
<h2>Mixed Content</h2>
<div class='sample'>
Welcome 欢迎 ようこそ 환영합니다 مرحبا Добро пожаловать
</div>
</body>
</html>";
using var pdf = renderer.RenderHtmlAsPdf(html);
return pdf.BinaryData;
}
public byte[] GenerateInternationalInvoice(InvoiceData invoice)
{
var renderer = new ChromePdfRenderer();
string html = $@"
<!DOCTYPE html>
<html>
<head>
<meta charset='UTF-8'>
<style>
body {{ font-family: system-ui, sans-serif; }}
.address {{ white-space: pre-line; }}
</style>
</head>
<body>
<h1>Invoice / 发票 / 請求書</h1>
<div class='address'>
{invoice.CustomerName}
{invoice.Address}
</div>
<!-- Address with Chinese/Japanese characters works automatically -->
</body>
</html>";
using var pdf = renderer.RenderHtmlAsPdf(html);
return pdf.BinaryData;
}
}
Key points:
- UTF-8 HTML with any script works automatically
- No font registration required
- Right-to-left languages (Arabic, Hebrew) handle correctly
- Mixed-language content renders properly
Additional Language Examples
public byte[] GenerateVietnameseDocument()
{
var renderer = new ChromePdfRenderer();
// Vietnamese uses Latin script with extensive diacritics
string html = @"
<html>
<head><meta charset='UTF-8'></head>
<body>
<h1>Tiếng Việt</h1>
<p>Xin chào! Chào mừng bạn đến với dịch vụ của chúng tôi.</p>
<p>Đây là một tài liệu thử nghiệm bằng tiếng Việt.</p>
</body>
</html>";
using var pdf = renderer.RenderHtmlAsPdf(html);
return pdf.BinaryData;
}
public byte[] GenerateGreekDocument()
{
var renderer = new ChromePdfRenderer();
string html = @"
<html>
<head><meta charset='UTF-8'></head>
<body>
<h1>Ελληνικά</h1>
<p>Γεια σας! Καλώς ήρθατε στην υπηρεσία μας.</p>
<p>Αυτό είναι ένα δοκιμαστικό έγγραφο στα ελληνικά.</p>
</body>
</html>";
using var pdf = renderer.RenderHtmlAsPdf(html);
return pdf.BinaryData;
}
public byte[] GenerateComplexMixedDocument()
{
var renderer = new ChromePdfRenderer();
// Document with multiple scripts and directions
string html = @"
<html>
<head>
<meta charset='UTF-8'>
<style>
body { font-family: system-ui, sans-serif; }
.rtl { direction: rtl; text-align: right; }
table { width: 100%; border-collapse: collapse; }
td { padding: 10px; border: 1px solid #ccc; }
</style>
</head>
<body>
<h1>International Customer List</h1>
<table>
<tr>
<td>Japan</td>
<td>田中太郎</td>
<td>東京都港区</td>
</tr>
<tr>
<td>China</td>
<td>李明</td>
<td>北京市朝阳区</td>
</tr>
<tr>
<td>UAE</td>
<td class='rtl'>محمد أحمد</td>
<td class='rtl'>دبي، الإمارات</td>
</tr>
<tr>
<td>Israel</td>
<td class='rtl'>יוסי כהן</td>
<td class='rtl'>תל אביב</td>
</tr>
<tr>
<td>Russia</td>
<td>Иван Петров</td>
<td>Москва</td>
</tr>
<tr>
<td>Thailand</td>
<td>สมชาย ใจดี</td>
<td>กรุงเทพมหานคร</td>
</tr>
</table>
</body>
</html>";
using var pdf = renderer.RenderHtmlAsPdf(html);
return pdf.BinaryData;
}
API Reference
Migration Considerations
Licensing
- iText has AGPL or commercial licensing
- IronPDF is commercial with perpetual licensing
- IronPDF Licensing
What You Gain
- Automatic Unicode support for all languages
- No font registration or configuration
- RTL language support built-in
- Same results as browser rendering
What to Consider
- Different library, different API
- Chromium-based vs custom renderer
- Commercial licensing
Conclusion
iText's non-Latin character issues require significant manual configuration that is error-prone and hard to maintain across all possible character sets. Browser-based rendering engines handle Unicode automatically through system font fallback, providing reliable international text support without manual font management.
Written by Jacob Mellor, who leads technical development at Iron Software.
References
- Stack Overflow: Generation of PDF from HTML with non-Latin characters{:rel="nofollow"} - 27K+ views
For the latest IronPDF documentation and tutorials, visit ironpdf.com.
Top comments (0)