DEV Community

mixbo
mixbo

Posted on

How to caculate emoji length?

Alt Text

In Javascript we just call length on String object will return the length you want to get.

But when you get emoji length from javascript become more trouble let show you what i found.

"ihavecoke".length // 9

"😁".length // 2
Enter fullscreen mode Exit fullscreen mode

As you can see, when you call length on 'ihavecoke' you got length 9 it's ture and make sense.

The line number 3 you just got length 2. what ? a emoji char just 2 length of string?

The "πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦".length more strange that return 11 why ? emoji char not return 2 always?

So how to caculate emoji length to 1 length? you can use lodash method toArray it's simple and useful

_.toArray("😁").length // 1

_.toArray("πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦").length // 1

_.toArray("ihavecoke 🀩").length // 11
Enter fullscreen mode Exit fullscreen mode

So we got the 1 length emoji lols

Hope it can help you :)

Latest comments (6)

Collapse
 
powerc9000 profile image
Clay Murray

As to why this happens. The web is typically in UTF8 (read 99% of all internet users use UT8) which is a way to encode characters. Basically encoding is assigning a number to a letter. Like you could encode "A" as 1 "B" as 2 etc. Back in the day there was a thing created called ASCII en.wikipedia.org/wiki/ASCII but it only used 7 bits to encode the characters. Which means a max of 128 total characters. Well that's fine for english but what if you need more? So there were a bunch of different ways text got encoded. Like lots and lots and some of them incompatible with ascii. Eventually the web settled on a way to make letters as numbers called UTF8. UTF8 is interesting because it can use a variable number of bits to represent a character (up to 32 currently). This makes it compatible with the old ASCII. But also allows for a huge number of different characters and languages. So the scheme looks at the first 8 bits and if it's in a certain range it will look at the next 8 bits etc until it can make a character.
Well to put a wrinkle in it. Although web pages and code are all in utf8, internally javascript stores strings as utf16. UTF16 is like UTF8 but instead of using a minimum size of 8 bits it uses 16 or 32 bits to represent a letter. So when you ask javascript how long a string is, it breaks it up into 16 bit chunks and tells you how many 16 bit chunks there are. BUT some characters (and emoji) are encoded as two 16 bit chunks so javascript will tell you that the length is 2

So that's part 1. Part 2 is emoji. Emoji are interesting. What you see on screen is not necessarily the full truth. Emoji have a way to be joined together. For instance the pride flag πŸ³οΈβ€πŸŒˆ is ACUALLY a white flag 🏳 and a rainbow 🌈 mashed together with an invisible emoji that says "hey mash these two together". So on systems that don't know about the pride flag you just get 🏳 🌈. Well what does that tell us about length? Well 🏳 is 2 and 🌈 is 2 and πŸ³οΈβ€πŸŒˆ is 6. 6 because of the invisible "mash these two together" character. So what is it about πŸ‘¨β€πŸ‘¨β€πŸ‘§β€πŸ‘¦ that returns 11? Well it's a super mashup emoji it's πŸ‘¨β€ and πŸ‘¨β€ and πŸ‘§β€ and πŸ‘¦ all put together with the "mash these two together character" it actually makes it possible to have a huge variety of family emojis because we are combining them. So why 11 and not 14? (length 2 for each man length 2 for each for the children and 3 mash together characters) well man emoji are only length 1 not length 2 and the girl emoji is length 1 not two so we can subtract 3 from 14 netting 11 length. (176 total bits for just that emoji! Compared to just 8 for the letter A)

Collapse
 
jamalx31 profile image
Jamal Mashal • Edited
"😁".split().length
Enter fullscreen mode Exit fullscreen mode
Collapse
 
peterblockman profile image
Peter Blockman

Does not work

"πŸ₯΅πŸ‘©β€πŸ‘©β€πŸ‘¦β€πŸ‘¦".split().length // 1
"😁😁😁".split().length // 1
Enter fullscreen mode Exit fullscreen mode
Collapse
 
ihavecoke profile image
mixbo

πŸ‘

Collapse
 
patarapolw profile image
Pacharapol Withayasakpunt • Edited

I would runes2, but it doesn't hide the reality than beyond UTF-8 is complex.

If you just want to match unicode symbols, a better and non-cryptic idea, is to use XRegExp. For other symbols, it is XRegExp('\\p{So}').

Collapse
 
shriji profile image
Shriji

HAHAHAH, Never thought about this.

Indeed a great question while I was digging I found an answer on SO stackoverflow.com/a/46085147