Markus

Posted on Sep 14 • Originally published at the-main-thread.com on Sep 7

Mastering Unicode in Java: Build World-Ready REST APIs with Quarkus

#java #quarkus #unicode

Most Java developers have typed String name = "Hello"; more times than they can count. It works. No surprises. But the illusion of simplicity breaks the moment "こんにちは", "浩宇", or "😉" shows up in your system. Suddenly, that simple String reveals a universe of complexity. Bugs creep in. Data gets corrupted. Users complain their names aren’t stored correctly.

This tutorial will demystify Unicode and show you how to build robust, world-ready Java applications. We’ll cover the theory, expose the gotchas, and then build a “Global Greeting Service” with Quarkus that survives the chaos of real-world text.

By the end, you’ll understand:

Unicode fundamentals and how Java really stores text
Why string length and iteration are trickier than they look
How normalization prevents nasty mismatches
How to configure your REST service and database for safe Unicode handling

Let’s get started.

Unicode Fundamentals: The Bedrock of Modern Text

Unicode is often misunderstood. It’s not an encoding like UTF-8 or UTF-16. It’s a standard. A giant dictionary that assigns a unique number (a code point) to every character and emoji.

The letter “A” → U+0041
The winking face 😉 → U+1F609

Encodings (UTF-8, UTF-16) decide how to store these numbers as bytes. Java uses UTF-16 internally, which introduces some subtle traps.

Code Points vs. Code Units vs. Grapheme Clusters

Think of three levels:

Code point: The abstract number from Unicode (U+1F48B = 💋).
Code unit: How encodings represent those points in memory. UTF-16 uses 16-bit units, sometimes one, sometimes two.
Grapheme cluster: What humans see as “a single character.” Could be one code point or several combined (e.g., “e” + combining accent).

Example: "a🚀c"

Grapheme clusters: 3 (a, 🚀, c)
Code points: 3 (U+0061, U+1F680, U+0063)
UTF-16 code units: 4 (🚀 takes two units as a surrogate pair)

This explains why string.length() often lies to you.

Normalization Matters

The same visual character can have multiple representations:

"é" = U+00E9 (precomposed)
"e" + "´" = U+0065 + U+0301 (composed)

Without normalization, "café" might not equal "café". Normalization (usually NFC ) ensures consistent storage and comparison.

Building the "Global Greeting Service" with Quarkus

Let's put theory into practice. We'll build a simple REST API that stores and retrieves greetings.

Project Setup

You'll need Java (17+), Maven, Podman, and a terminal.

Generate the Quarkus Project:

mvn io.quarkus.platform:quarkus-maven-plugin:create \
    -DprojectGroupId=org.acme \
    -DprojectArtifactId=unicode-greetings \
    -DclassName="org.acme.GreetingResource" \
    -Dpath="/greetings" \
    -Dextensions="rest-jackson,quarkus-hibernate-orm-panache,quarkus-jdbc-postgresql"

cd unicode-greetings

Delete the src/main/test resources. (I know 🙄.)

Create the Greeting Entity:

Rename the MyEntity.java to src/main/java/org/acme/Greeting.java and replace with the following:

package org.acme;

import io.quarkus.hibernate.orm.panache.PanacheEntity;
import jakarta.persistence.Entity;

@Entity
public class Greeting extends PanacheEntity {
    public String name;
    public String message;
}

Update the GreetingResource:

Replace the contents of src/main/java/org/acme/GreetingResource.java:

package org.acme;

import java.util.List;

import jakarta.transaction.Transactional;
import jakarta.ws.rs.Consumes;
import jakarta.ws.rs.GET;
import jakarta.ws.rs.POST;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;

@Path("/greetings")
@Produces(MediaType.APPLICATION_JSON)
@Consumes(MediaType.APPLICATION_JSON)
public class GreetingResource {

    @GET
    public List<Greeting> getAll() {
        return Greeting.listAll();
    }

    @POST
    @Transactional
    public Response add(Greeting greeting) {
        greeting.persist();
        return Response.status(Response.Status.CREATED).entity(greeting).build();
    }
}

Configure the Database:

Update src/main/resources/application.properties for a local PostgreSQL database.

Properties

# Database configuration
quarkus.datasource.db-kind=postgresql

# Important for Unicode! Ensure the client connection talks UTF-8.
quarkus.datasource.jdbc.additional-jdbc-properties.charSet=UTF-8

# Drop and create the schema on startup for development
quarkus.hibernate-orm.schema-management.strategy=drop-and-create

Start the Application:

./mvnw quarkus:dev

Quarkus will automatically start a PostgreSQL container for you.

You now have a basic REST service. Let's start breaking it with Unicode.

Java-Specific Challenges: The Gotchas Appear

Our simple service works fine for ASCII. Now let's introduce a name with an emoji and see what happens.

The `String.length()` Lie

Let's add a "safety check" to our resource to prevent overly long names.

Modify the add method in GreetingResource.java:

// In GreetingResource.java
@POST
@Transactional
public Response add(Greeting greeting) {
    // A seemingly innocent validation check
    if (greeting.name != null && greeting.name.length() > 6) {
        return Response.status(Response.Status.BAD_REQUEST)
                .entity("{\"error\":\"Name cannot exceed 6 characters\"}")
                .build();
    }
    greeting.persist();
    return Response.status(Response.Status.CREATED).entity(greeting).build();
}

Now, try to post a greeting using curl:

curl -X POST http://localhost:8080/greetings \
-H "Content-Type: application/json" \
-d '{ "name": "Team 🚀", "message": "To the moon!" }'

Result: You get a 400 Bad Request!

{"error":"Name cannot exceed 6 characters"}

But "Team 🚀" looks like 6 characters. What gives? As we learned, the rocket emoji requires a surrogate pair in UTF-16. So greeting.name.length() returns 7 (T-e-a-m- -🚀[part1]-🚀[part2]), which is greater than 6. Oops.

The Fix: Use codePointCount() to get the true number of code points, which aligns with the user's perception of "characters".

// In GreetingResource.java
// ...
if (greeting.name != null && greeting.name.codePointCount(0, greeting.name.length()) > 6) {
// ...

Update the code and try the curl command again. Success! The greeting is created.

Iterating Correctly

Another common mistake is iterating over a String's char array. Let's imagine we want to create a slug from a name by filtering characters.

// Don't do this! This is a demonstration of what NOT to do.
public static String createSlug(String input) {
    StringBuilder slug = new StringBuilder();
    for (char c : input.toCharArray()) {
        if (Character.isLetterOrDigit(c)) {
            slug.append(Character.toLowerCase(c));
        }
    }
    return slug.toString();
}

// In some test method:
System.out.println(createSlug("User-👍-Name"));
// Expected output: "username"
// Actual output: "username" -> It appears to work, but it silently mangles the emoji.

When the loop encounters the 👍 emoji, it processes each half of the surrogate pair separately. Character.isLetterOrDigit() returns false for both halves, so they are skipped. This might seem fine, but for other operations, you could end up with half an emoji, which is corrupt data.

The Fix: Use the codePoints() stream. This correctly presents each code point, regardless of whether it's one or two code units.

public static String createSlugProperly(String input) {
    StringBuilder slug = new StringBuilder();
    input.codePoints().forEach(codePoint -> {
        if (Character.isLetterOrDigit(codePoint)) {
            slug.append(Character.toLowerCase(Character.toChars(codePoint)));
        }
    });
    return slug.toString();
}

This version is Unicode-safe. It correctly handles any character from any language or emoji set.

Web Development Pain Points

Now let's tackle problems that arise when our service interacts with other systems, like databases and clients.

The Normalization Search Problem

Add a new greeting:

curl -X POST http://localhost:8080/greetings \
-H "Content-Type: application/json" \
-d '{ "name": "José", "message": "Hola!" }'

The name is stored with the precomposed character é (U+00E9). Now, imagine a user with a keyboard that produces the letter e followed by a combining accent ´ searches for "José". Their search term is byte-for-byte different.

Let's add a search endpoint to see this fail.

In GreetingResource.java:

    @GET
    @Path("/search")
    public Response search(@QueryParam("name") String name) {
        if (name == null) {
            return Response.ok(List.of()).build();
        }
        // This is a naive, direct comparison that will fail
        List<Greeting> results = Greeting.list("name", name);
        return Response.ok(results).build();
    }

Now, try to search for "Jose".

curl "http://localhost:8080/greetings/search?name=Jose"

It obviously returns an empty result. But even if we could type the version with the combining accent, it would also fail.

The Fix: Normalize all strings to a consistent form. We'll use NFC.

Modify the add method: Normalize the name before saving it.

// In GreetingResource.java's add() method
import java.text.Normalizer;
// ...
 greeting.name = Normalizer.normalize(greeting.name, Normalizer.Form.NFC);

// ... then the validation check and persist

Modify the search method: Normalize the search query before looking it up.

// In GreetingResource.java's search() method
    @GET
    @Path("/search")
    public Response search(@QueryParam("name") String name) {
        if (name == null) {
            return Response.ok(List.of()).build();
        }
        String normalizedName = Normalizer.normalize(name, Normalizer.Form.NFC);
        // This is a naive, direct comparison that will fail
        List<Greeting> results = Greeting.list("name", normalizedName);
        return Response.ok(results).build();
    }

Now, regardless of how "José" is typed, it will be converted to the same canonical form. And the search? Still fails? WHY? Welcome to character encoding world.

The problem is URL encoding:

Stored in DB : José (bytes: [74, 111, 115, -61, -87])
Received from query : JosÃ© (bytes: [74, 111, 115, -61, -125, -62, -87])

The é character is being double-encoded when sent via curl. The é (U+00E9) is being encoded as %C3%A9, but then that's being interpreted as Ã© because of how the bytes are being processed.

URL encoding (also called percent-encoding) converts special characters that aren't safe for URLs into a format that can be safely transmitted. For example, the letter "é" becomes "%C3%A9" because it's represented as two bytes (C3 A9 in hexadecimal) in UTF-8 encoding, and each byte is prefixed with a percent sign. This ensures that characters like spaces, accented letters, and symbols don't interfere with URL parsing or cause issues when transmitted across different systems that might handle character encoding differently.

You can either fix the CURL:

curl "http://localhost:8080/greetings/search?name=Jos%C3%A9"

Or update the handling in the search method to be more permissive.

    @GET
    @Path("/search")
    public Response search(@QueryParam("name") String name) {
        if (name == null) {
            return Response.ok(List.of()).build();
        }

        // Handle URL decoding issues by normalizing both the query and stored values
        String normalizedName = Normalizer.normalize(name, Normalizer.Form.NFC);

        // Use a more flexible search that handles encoding differences
        List<Greeting> results = Greeting.find("LOWER(name) = LOWER(?1)", normalizedName).list();

        // If no results, try a more permissive search
        if (results.isEmpty()) {
            results = Greeting.find("name LIKE ?1", "%" + normalizedName + "%").list();
        }        
        return Response.ok(results).build();
    }

Note: A truly user-friendly search would also be case-insensitive and might even strip accents (e.g., so "Jose" finds "José"). This often requires using like in the database query and a separate library for accent stripping, but normalization is the essential first step. We only use a very broad, permissive search as fallback here.

Sorting with `Collator`

Let's add a feature to get a sorted list of greetings.

In GreetingResource.java:

@GET
@Path("/sorted")
public List<Greeting> getSortedByName() {
    return Greeting.list("order by name");
}

Now add these three names to your service: "Zebra", "Ångström", "Aaron".

curl -X POST http://localhost:8080/greetings \
-H "Content-Type: application/json" \
-d '{ "name": "Zebra"}'

curl -X POST http://localhost:8080/greetings \
-H "Content-Type: application/json" \
-d '{ "name": "Ånstöm"}'

curl -X POST http://localhost:8080/greetings \
-H "Content-Type: application/json" \
-d '{ "name": "Aaron"}'

When you call

curl "http://localhost:8080/greetings/sorted"

you'll likely get this order:

Aaron
Ångström
Zebra

This is the default byte-value sort order. However, in Swedish, "Å" is the 27th letter of the alphabet, so "Ångström" should come after "Zebra".

The Fix: For language-sensitive sorting, you must use java.text.Collator. Since the database sort is naive, we must sort in the Java application code.

// In GreetingResource.java
import java.text.Collator;
import java.util.Comparator;
import java.util.List;
import java.util.Locale;

// ...
    @GET
    @Path("/sorted")
    public List<Greeting> getSortedByName(@QueryParam("locale") @DefaultValue("en-US") String localeTag) {
        List<Greeting> greetings = Greeting.listAll();
        Locale locale = Locale.forLanguageTag(localeTag);
        Collator collator = Collator.getInstance(locale);
        collator.setStrength(Collator.PRIMARY); // Makes it case-insensitive too

        greetings.sort(Comparator.comparing(g -> g.name, collator));
        return greetings;
    }

Now, if you call

curl "http://localhost:8080/greetings/sorted?locale=sv-SE"

For Swedish, you will get the correct, culturally-expected order:

Aaron
Zebra
Ångström

The Grand Finale: The Kiss Emoji 💋 Endpoint

Let's add a final, fun endpoint that serves as a practical test of our setup. This endpoint will append a kiss emoji to a greeting's message.

In GreetingResource.java:

@POST
@Path("/{id}/kiss")
@Transactional
public Response addKiss(@PathParam("id") Long id) {
    Greeting greeting = Greeting.findById(id);
    if (greeting == null) {
        return Response.status(Response.Status.NOT_FOUND).build();
    }

    // U+1F48B is the code point for the kiss mark emoji 💋
    String kissEmoji = new String(Character.toChars(0x1F48B));
    greeting.message = greeting.message + " " + kissEmoji;

    greeting.persist();
    return Response.ok(greeting).build();
}

First, create a greeting to kiss:

curl -X POST http://localhost:8080/greetings \
-H "Content-Type: application/json; charset=utf-8" \
-d '{ "name": "浩宇", "message": "xoxo" }'

(Note the explicit charset=utf-8 in the header. This is a best practice!)

The response will show the newly created greeting with id: 1 (or some other number). Now, use that ID to send a virtual kiss:

curl -s -X POST "http://localhost:8080/greetings/1/kiss" | jq -r .

Result: You should get a perfect JSON response with the emoji correctly rendered.

{
  "id": 1,
  "name": "浩宇",
  "message": "xoxo 💋"
}

This confirms that your entire stack, from the client, through the Quarkus REST layer, to the database, and back, is correctly configured to handle Unicode, including multi-byte characters and emoji.

Conclusion & Best Practices Checklist

Congratulations! You've built a Unicode-aware REST service and tackled some of the most common and frustrating bugs related to text handling in Java.

Keep these in mind for every Unicode-aware service:

Always set charset=utf-8 in Content-Type headers.
Configure databases and connections explicitly for UTF-8.
Use codePointCount() and codePoints() instead of length() and toCharArray().
Normalize user input before storing or comparing.
Use Collator for locale-aware sorting.
Assume Unicode everywhere. ASCII is no longer a safe baseline.

Text is global. Your code should be too.

DEV Community

Mastering Unicode in Java: Build World-Ready REST APIs with Quarkus

Unicode Fundamentals: The Bedrock of Modern Text

Code Points vs. Code Units vs. Grapheme Clusters

Normalization Matters

Building the "Global Greeting Service" with Quarkus

Project Setup

Java-Specific Challenges: The Gotchas Appear

The `String.length()` Lie

Iterating Correctly

Web Development Pain Points

The Normalization Search Problem

Sorting with `Collator`

The Grand Finale: The Kiss Emoji 💋 Endpoint

Conclusion & Best Practices Checklist

Top comments (0)

Unicode Fundamentals: The Bedrock of Modern Text

Code Points vs. Code Units vs. Grapheme Clusters

Normalization Matters

Building the "Global Greeting Service" with Quarkus

Project Setup

Java-Specific Challenges: The Gotchas Appear

The String.length() Lie

Iterating Correctly

Web Development Pain Points

The Normalization Search Problem

Sorting with Collator

The Grand Finale: The Kiss Emoji 💋 Endpoint

Conclusion & Best Practices Checklist

The `String.length()` Lie

Sorting with `Collator`