DEV Community

Oluwanifemi Latunde
Oluwanifemi Latunde

Posted on

UTF-8 Validation

It's day 7 of the #I4G10DaysOfCodeChallenge. The objective of today's task was to determine whether the set of integers constitutes a valid UTF8 string or not.

You can find more details about the challenge here

A character in UTF-8 can be from 1 to 4 bytes long, subjected to the following rules:

  1. For a 1-byte character, the first bit is a 0, followed by its Unicode code.
  2. For n-bytes character, the first n-bits are all ones, the n+1 bit is 0, followed by n-1 bytes with the most significant 2 bits being 10.
     Number of Bytes   |        UTF-8 Octet Sequence
                       |              (binary)
   --------------------+-----------------------------------------
            1          |   0xxxxxxx
            2          |   110xxxxx 10xxxxxx
            3          |   1110xxxx 10xxxxxx 10xxxxxx
            4          |   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Enter fullscreen mode Exit fullscreen mode

Syntax:

  1. Start with count = 0.
  2. for ā€œcā€ ranging from 0 to the size of the data array.

  3. If the count is 0, then:

  4. If x/32 = 110, then set count as 1. (x/32 is same as doing x >> 5 )

  5. Else if x/16 = 1110, then count = 2 (x/16 is same as doing x >> 4 )

  6. Else If x/8 = 11110, then count = 3. (x/8 is same as doing x >> 3 )

  7. Else if x/128 is 0, then return false. (x/128 is same as doing x >> 7 )

  8. Else If x/64 is not 10, then return false and decrease the count by 1.

  9. When the count is 0, return true.

Result:
Runtime: 234 ms, faster than 56.39% of Python3 online submissions for UTF-8 Validation.

Memory Usage: 14.1 MB, less than 97.22% of Python3 online submissions for UTF-8 Validation.

Top comments (0)