It's day 7 of the #I4G10DaysOfCodeChallenge. The objective of today's task was to determine whether the set of integers constitutes a valid UTF8 string or not.
You can find more details about the challenge here
A character in UTF-8 can be from 1 to 4 bytes long, subjected to the following rules:
- For a 1-byte character, the first bit is a 0, followed by its Unicode code.
- For n-bytes character, the first n-bits are all ones, the n+1 bit is 0, followed by n-1 bytes with the most significant 2 bits being 10.
Number of Bytes | UTF-8 Octet Sequence
| (binary)
--------------------+-----------------------------------------
1 | 0xxxxxxx
2 | 110xxxxx 10xxxxxx
3 | 1110xxxx 10xxxxxx 10xxxxxx
4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Syntax:
- Start with count = 0.
for “c” ranging from 0 to the size of the data array.
If the count is 0, then:
If x/32 = 110, then set count as 1. (x/32 is same as doing x >> 5 )
Else if x/16 = 1110, then count = 2 (x/16 is same as doing x >> 4 )
Else If x/8 = 11110, then count = 3. (x/8 is same as doing x >> 3 )
Else if x/128 is 0, then return false. (x/128 is same as doing x >> 7 )
Else If x/64 is not 10, then return false and decrease the count by 1.
When the count is 0, return true.
Result:
Runtime: 234 ms, faster than 56.39% of Python3 online submissions for UTF-8 Validation.
Memory Usage: 14.1 MB, less than 97.22% of Python3 online submissions for UTF-8 Validation.
Top comments (0)