DEV Community

Paweł bbkr Pabian
Paweł bbkr Pabian

Posted on

3 1

Fun with UTF-8: browsing code points namespace

This trick is very useful if you need to find all code points with given name or property. Let's find all characters that have ogonek (tiny tail) from previous post.

$ raku -e 'for 1..1_112_064 {
    next unless .uniname.contains( "WITH OGONEK" );
    say .chr, " ", .uniname;
}'

Ą LATIN CAPITAL LETTER A WITH OGONEK
ą LATIN SMALL LETTER A WITH OGONEK
Ę LATIN CAPITAL LETTER E WITH OGONEK
ę LATIN SMALL LETTER E WITH OGONEK
Į LATIN CAPITAL LETTER I WITH OGONEK
į LATIN SMALL LETTER I WITH OGONEK
Ų LATIN CAPITAL LETTER U WITH OGONEK
ų LATIN SMALL LETTER U WITH OGONEK
Ǫ LATIN CAPITAL LETTER O WITH OGONEK
ǫ LATIN SMALL LETTER O WITH OGONEK
Ǭ LATIN CAPITAL LETTER O WITH OGONEK AND MACRON
ǭ LATIN SMALL LETTER O WITH OGONEK AND MACRON
Enter fullscreen mode Exit fullscreen mode

In Raku you can call uniname or chr methods on integer value directly to get code point name or character under this code point respectively. If you are not familiar with .method syntax - this is just a lazy way to call a method inside a block on whatever value your iteration is at the moment, without assigning it explicitly to named variable. If you want you can be more explicit about it like: for 1..1_112_064 -> $codepoint { next unless $codepoint.uniname... }.

Did you know that:

  • There are 899 digits defined in Unicode?
$ raku -e '( 1 .. 1_112_064 ).grep( *.uniname.contains( "DIGIT" ) ).elems.say;'

899
Enter fullscreen mode Exit fullscreen mode
  • There are 154 sentence terminals?
$ raku -e '( 1 .. 1_112_064 ).grep( *.uniprop( "Sentence_Terminal" ) ).elems.say;'

154
Enter fullscreen mode Exit fullscreen mode

(Unicode properties will be explained in next post)

  • Within those 154 sentence terminals there are 22 question marks?
$ raku -e 'for 1 .. 1_112_064 { next unless .uniname.ends-with( "QUESTION MARK" ); say .chr, " ", .uniname; }'

? QUESTION MARK
¿ INVERTED QUESTION MARK
; GREEK QUESTION MARK
՞ ARMENIAN QUESTION MARK
؟ ARABIC QUESTION MARK
፧ ETHIOPIC QUESTION MARK
᥅ LIMBU QUESTION MARK
...
Enter fullscreen mode Exit fullscreen mode

So, what was the funniest thing you found in Unicode?¿⸮

Image of Timescale

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Timescale Cloud is PostgreSQL optimized for speed, scale, and performance. Over 3 million IoT, AI, crypto, and dev tool apps are powered by Timescale. Try it free today! No credit card required.

Try free

Top comments (0)

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started

👋 Kindness is contagious

Dive into an ocean of knowledge with this thought-provoking post, revered deeply within the supportive DEV Community. Developers of all levels are welcome to join and enhance our collective intelligence.

Saying a simple "thank you" can brighten someone's day. Share your gratitude in the comments below!

On DEV, sharing ideas eases our path and fortifies our community connections. Found this helpful? Sending a quick thanks to the author can be profoundly valued.

Okay