DEV Community

Tom R.
Tom R.

Posted on

Surprisingly Simple Range Sets with Bisection in Python

Recently I came across the problem of finding Unicode identifiers in a text. Don't worry, this article is not about lexical analysis. All you need to know is that the problem boils down to
checking each character for the derived Unicode properties XID_Start(characters which are allowed as first character) and XID_Continue (allowed follow-up characters).

Whether you want to understand all the details mentioned in the Unicode® Standard Annex #31 or you simply get yourself a copy of this handy text file with a list of all character ranges sharing the desired property, at the end of the day you need to check for a given code point whether it falls within one of those ranges or whether it doesn't. And you need to check this for many characters, so your underlying approach should be reasonably fast.

Having said that, if speed is the only thing that counts for you, this might not be the article you want to dive into. Other people have solved this problem before (see for exmple dtolnay/unicode-ident).

What I am going to show to you is an elegant yet rather efficient way of storing and accessing sets of ranges. So lets dive into it.

A Red and Black Tape Measure

What we have to deal with are of
the order of thousand intervals of
the form 0041..005A, 0061..007A, 00AA..00AA, 00B5..00B5, 00BA..00BA,
00C0..00D6, ...

Now imagine you take a tape measure or a ruler or the like and color all tape sections belonging to one of those intervals in black and all sections not belonging to any of the intervals in red.

A sketch of a tape measure with alternating sections colored in red and black

All we need to remember to color a new tape measure with exactly the same coloring is an ordered list of points where a new section starts and the information whether at the first point the color changes from red to black or vice versa. For now we only deal with finite ranges so we stick to the convention that the first point in the list is the starting point of a range and hence is marked as a change from red to black.

For our numbers from the Unicode ranges we need to store the following list:

bounds = [
   0x0041, 0x005B,
   0x0061, 0x007B,
   0x00AA, 0x00AB,
   0x00B5, 0x00B5,
   0x00BA, 0x00BA,
   0x00C0, 0x00D6,
   # ...
]
Enter fullscreen mode Exit fullscreen mode

Note that we have written down the lower (included) bound of each range and the upper (excluded!) bound. What we have done here for integer numbers can be extended to any data type forming a totally ordered set:

  • rational or real numbers,
  • strings (with lexicographic ordering),
  • times and dates,
  • tuples and sequences thereof (with lexicographic ordering).
  • basically any class implementing the __lt__ method.

We haven't used any property of the integer numbers beyond that. The fact that we have chosen the upper included bound plus one as the start of the next red section merely reflects the discrete nature of the integer numbers.

How Bisection comes in

Now, let's have a look at how to check whether a character falls in one of the ranges. Say, we encounter a semicolon character ; (ASCII/Unicode code point 003B) and therefore need to check whether the code point 003B is situated in a red section (i.e. outside any of the ranges) or in a black section.

Obviously, scanning the list linearly from left to right cannot be an optimal strategy. Instead, we have to use something that exploits the fact that we operate on sorted data.

The algorithm of choice is bisection (binary search), which comes with the Python standard library in the bisect module.

If we use bisect.bisect_right we get an index such that all elements to the left of the index are below or equal to the argument of bisect_right, and all elements at or right to the index are strictly larger than the argument. Thus the number at the returned index is the smallest lower bound of a red or black section which is larger than the searched number. If the index is even the section beginning at that bound is black, so the section where the searched number falls into is red. This implies that the searched numbee is outside any of the ranges.

If on the other hand the index is odd the number must be part of a black section and hence is inside one of rhe ranges.

def contains_int(bounds, e):
   return bisect_right(bounds, e) % 2 == 1
Enter fullscreen mode Exit fullscreen mode

That's it! Solved! Done! Over! This could have been the end of this blog post.

Mutable Sets

Admittedly, part of the reason why everything is so simple is the fact that our data already was present as a list of disjunct, non-overlapping ranges. Imagine now, you are the poor soul that has to write the code to aggregate the ranges in this nice format by repeatedly adding and removing different and potentially overlapping parts of the Unicode range or, using the image of the tape measure, changing the colors of different parts of sections.

Adding a range to the set means you have to color a corresponding strip on the taoe measure in black. So boundary points within the new colored section must be removed from our list of bounds. Furthermore, both the start and the end point of the new section may be part of an existing red or black section.

If the existing section has the same color as the new section then the start (resp. end) point of the new section must not be added to the list. If the colors are different, of course, we have to insert the new point to the list.

Again, we can solve the problem using bisection what is new, though, is that we use bisect_left for the start point and bisect_right at the end point of the range to be added.

def add_range(bounds, lower, upper):
   new_strip = []

   il = bisect_left(bounds, lower)
   if il % 2 == 0:
      new_strip.append(lower)

   iu = bisect_right(bounds, upper)
   if iu % 2 == 0:
      span.append(upper)

   bounds[il:iu] = new_strip
Enter fullscreen mode Exit fullscreen mode

Removing a section is as easy as adding one because in our representation we just add a red section instead of adding a black section. So all we have to do is swapping "even" and "odd".

def del_range(bounds, lower, upper):
   new_strip = []

   il = bisect_left(bounds, lower)
   if il % 2 == 1:
      new_strip.append(lower)

   iu = bisect_right(bounds, upper)
   if iu % 2 == 1:
      span.append(upper)

   bounds[il:iu] = new_strip
Enter fullscreen mode Exit fullscreen mode

Infinite Ranges

As mentioned earlier, up to know we assumed that the first entry in the list of bounds corresponded to a lower bound of a range, i.e. a color change from red to black. If we maintain the information of the first color change explicitly, we can represent negation, ranges starting at minus infinity or ranges running up to infinity with minimal extra effort.

Final Thought

I haven't done an extensive search, so I would not be surprised if anyone else has published the same data structure and algorithm before. If you know about such a cross-reference let me know, nd I'll update the links.

Top comments (0)