DEV Community

Cover image for punycode Release v0.15.0
Jonas Brømsø
Jonas Brømsø

Posted on

punycode Release v0.15.0

I wanted to do some cleaning up my various repositories on GitHub, I did not get very far, before I got caught up looking and an old issue and an old branch.

The issue (#34), was on problems with punycode and certain Unicode characters. I have attempted to wrap my head around this problem on a couple of occasions before, but never really got anywhere. But I had this old branch where I had started to look into it, so I decided to pick it up again.

This is the description from the issue:

Hey, Punycodes/Emojis containing Zero Width Joiner are not handled the right way.

xn--8k8hlfr9n should be 🧑🏾‍🎨 not 🧑🏾🎨

punycode 🧑🏾‍🎨 shows xn--1ug6825plhas9r instead of xn--8k8hlfr9n.

Any chance to fix this?

After some digging into the Unicode Standard, I found that the Zero Width Joiner (ZWJ) is used to create compound characters by joining multiple characters together. In this case, the ZWJ is used to combine the base character (🧑🏾) with the additional character (🎨) to form a single compound character (🧑🏾‍🎨).

I realized that the existing implementation of punycode did not account for the ZWJ when encoding and decoding characters. To fix this, I modified the encoding and decoding functions to handle the ZWJ.

After implementing the changes, I tested the updated punycode implementation with various inputs, including those containing the ZWJ. The tests confirmed that the issue was resolved, and the correct punycode representation was generated for characters with ZWJ.

In the branch I had started looking into exchaging the core library, since after all the client in my repository is just a CLI client using existing implementations.

With the lastest update, it can now output:

./punycode xn--8k8hlfr9n
🧑🏾🎨
Enter fullscreen mode Exit fullscreen mode

and:

./punycode xn--1ug6825plhas9r
🧑🏾‍🎨
Enter fullscreen mode Exit fullscreen mode

I checked the two decodings against online implementations, and they all agree on the same output.

So I believe my code is not doing it right.

But if you re-read the original issue:

xn--8k8hlfr9n should be 🧑🏾‍🎨 not 🧑🏾🎨

punycode 🧑🏾‍🎨 shows xn--1ug6825plhas9r instead of xn--8k8hlfr9n.

So my client might still be doing it wrong, since it decodes xn--8k8hlfr9n to 🧑🏾🎨, while the issue reporter expects it to decode to 🧑🏾‍🎨.

But if I check it against the listed online implementations, my client is now aligned with these.

I have closed the issue as a wont fix, since I believe the issue reporter has the report the wrong way around, but I have requested that if more information is available, they should re-open the issue or open a new one.

The issue originates from 2023 and is part of my path to learning Go and you can also read from the documentation:

This utility was created, when in the process of learning Go. I have worked in the DNS and domain name business for a decade so it was only natural to work on something I know when learning Go.

This particular repository touched the following topics:

  • Learning to make CLI tools
  • Making an executable distributable and installable component
  • Reading data from the CLI
  • Reading data from STDIN
  • Testing a CLI tool / Main function in Go

So it might never be perfect, but it has served its purpose well and another solution could be to only handle the part with is relevant to DNS, which initself is only a subset of the full Unicode standard, but a rabbithole of it's own.

Any suggestions or ideas for improvements are welcome, just open an issue or a pull request on the repository.

The repository and tool chain had not been touched in a long time, so I had to make release v0.16.0 to get it working with the latest Go version and modules.

Top comments (0)