<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tom R.</title>
    <description>The latest articles on DEV Community by Tom R. (@tom65536).</description>
    <link>https://dev.to/tom65536</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3580701%2F466c9f5f-ea91-4274-ba07-4ebe4c8948b4.png</url>
      <title>DEV Community: Tom R.</title>
      <link>https://dev.to/tom65536</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tom65536"/>
    <language>en</language>
    <item>
      <title>Surprisingly Simple Range Sets with Bisection in Python</title>
      <dc:creator>Tom R.</dc:creator>
      <pubDate>Mon, 27 Oct 2025 21:31:10 +0000</pubDate>
      <link>https://dev.to/tom65536/surprisingly-simple-range-sets-with-bisection-in-python-20hp</link>
      <guid>https://dev.to/tom65536/surprisingly-simple-range-sets-with-bisection-in-python-20hp</guid>
      <description>&lt;p&gt;Recently I came across the problem of finding Unicode identifiers in a text. Don't worry, this article is not about lexical analysis. All you need to know is that the problem boils down to&lt;br&gt;
checking each character for the derived Unicode properties &lt;code&gt;XID_Start&lt;/code&gt;(characters which are allowed as first character) and &lt;code&gt;XID_Continue&lt;/code&gt; (allowed follow-up characters).&lt;/p&gt;

&lt;p&gt;Whether you want to understand all the details mentioned in the &lt;a href="https://www.unicode.org/reports/tr31/" rel="noopener noreferrer"&gt;Unicode® Standard Annex #31&lt;/a&gt; or you simply get yourself a copy of &lt;a href="https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt" rel="noopener noreferrer"&gt;this handy text file&lt;/a&gt; with  a list of all character ranges sharing the desired property, at the end of the day you need to check for a given code point whether it falls within one of those ranges or whether it doesn't. And you need to check this for many characters, so your underlying approach should be reasonably fast.&lt;/p&gt;

&lt;p&gt;Having said that, if speed is the only thing that counts for you, this might not be the article you want to dive into. Other people have solved this problem before (see for exmple &lt;a href="https://github.com/dtolnay/unicode-ident" rel="noopener noreferrer"&gt;dtolnay/unicode-ident&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;What I am going to show to you is an elegant yet rather efficient way of storing and accessing sets of ranges. So lets dive into it.&lt;/p&gt;
&lt;h2&gt;
  
  
  A Red and Black Tape Measure
&lt;/h2&gt;

&lt;p&gt;What we have to deal with are of&lt;br&gt;
the order of thousand intervals of&lt;br&gt;
the form 0041..005A, 0061..007A, 00AA..00AA, 00B5..00B5, 00BA..00BA,&lt;br&gt;
00C0..00D6, ...&lt;/p&gt;

&lt;p&gt;Now imagine you take a tape measure or a ruler or the like and color all tape sections belonging to one of those intervals in black and all sections not belonging to any of the intervals in red.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdh34n7fx7yx0ktinqnyb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdh34n7fx7yx0ktinqnyb.png" alt="A sketch of a tape measure with alternating sections colored in red and black" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;All we need to remember to color a new tape measure with exactly the same coloring is an ordered list of points where a new section starts and the information whether at the first point the color changes from red to black or vice versa. For now we only deal with finite ranges so we stick to the convention that the first point in the list is the starting point of a range and hence is marked as a change from red to black.&lt;/p&gt;

&lt;p&gt;For our numbers from the Unicode ranges we need to store the following list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;bounds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
   &lt;span class="mh"&gt;0x0041&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x005B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="mh"&gt;0x0061&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x007B&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="mh"&gt;0x00AA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x00AB&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="mh"&gt;0x00B5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x00B5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="mh"&gt;0x00BA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x00BA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="mh"&gt;0x00C0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x00D6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="c1"&gt;# ...
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that we have written down the lower (included) bound of each range and the upper (excluded!) bound. What we have done here for integer numbers can be extended to any data type forming a totally ordered set:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rational or real numbers,&lt;/li&gt;
&lt;li&gt;strings (with lexicographic ordering),&lt;/li&gt;
&lt;li&gt;times and dates,&lt;/li&gt;
&lt;li&gt;tuples and sequences thereof (with lexicographic ordering).&lt;/li&gt;
&lt;li&gt;basically any class implementing the &lt;code&gt;__lt__&lt;/code&gt; method.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We haven't used any property of the integer numbers beyond that. The fact that we have chosen the upper included bound &lt;em&gt;plus one&lt;/em&gt; as the start of the next red section merely reflects the discrete nature of the integer numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Bisection comes in
&lt;/h2&gt;

&lt;p&gt;Now, let's have a look at how to check whether a character falls in one of the ranges. Say, we encounter a semicolon character &lt;code&gt;;&lt;/code&gt; (ASCII/Unicode code point 003B) and therefore need to check whether the code point 003B is situated in a red section (i.e. outside any of the ranges) or in a black section.&lt;/p&gt;

&lt;p&gt;Obviously, scanning the list linearly from left to right cannot be an optimal strategy. Instead, we have to use something that exploits the fact that we operate on sorted data.&lt;/p&gt;

&lt;p&gt;The algorithm of choice is &lt;a href="https://en.wikipedia.org/wiki/Binary_search" rel="noopener noreferrer"&gt;bisection (binary search)&lt;/a&gt;, which comes with the Python standard library in the &lt;a href="https://docs.python.org/3/library/bisect.html#module-bisect" rel="noopener noreferrer"&gt;&lt;code&gt;bisect&lt;/code&gt;&lt;/a&gt; module.&lt;/p&gt;

&lt;p&gt;If we use &lt;a href="https://docs.python.org/3/library/bisect.html#bisect.bisect_right" rel="noopener noreferrer"&gt;&lt;code&gt;bisect.bisect_right&lt;/code&gt;&lt;/a&gt; we get an index such that all elements to the left of the index are below or equal to the argument of &lt;code&gt;bisect_right&lt;/code&gt;, and all elements at or right to the index are strictly larger than the argument. Thus the number at the returned index is the smallest lower bound of a red or black section which is larger than the searched number. If the index is even the section beginning at that bound is black, so the section where the searched number falls into is red. This implies that the searched numbee is outside any of the ranges.&lt;/p&gt;

&lt;p&gt;If on the other hand the index is odd the number must be part of a black section and hence is inside one of rhe ranges.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;contains_int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
   &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;bisect_right&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it! Solved! Done! Over! This could have been the end of this blog post.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mutable Sets
&lt;/h2&gt;

&lt;p&gt;Admittedly, part of the reason why everything is so simple is the fact that our data already was present as a list of disjunct, non-overlapping ranges. Imagine now, you are the poor soul that has to write the code to aggregate the ranges in this nice format by repeatedly adding and removing different and potentially overlapping parts of the Unicode range or, using the image of the tape measure, changing the colors of different parts of sections.&lt;/p&gt;

&lt;p&gt;Adding a range to the set means you have to color a corresponding strip on the taoe measure in black. So boundary points within the new colored section must be removed from our list of bounds. Furthermore, both the start and the end point of the new section may be part of an existing red or black section.&lt;/p&gt;

&lt;p&gt;If the existing section has the same color as the new section then the start (resp. end) point of the new section must not be added to the list. If the colors are different, of course, we have to insert the new point to the list.&lt;/p&gt;

&lt;p&gt;Again, we can solve the problem using bisection what is new, though, is that we use &lt;code&gt;bisect_left&lt;/code&gt; for the start point and &lt;code&gt;bisect_right&lt;/code&gt; at the end point of the range to be added.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
   &lt;span class="n"&gt;new_strip&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

   &lt;span class="n"&gt;il&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bisect_left&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;il&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;new_strip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

   &lt;span class="n"&gt;iu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bisect_right&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;iu&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

   &lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;il&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;iu&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new_strip&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Removing a section is as easy as adding one because in our representation we just add a red section instead of adding a black section. So all we have to do is swapping "even" and "odd".&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;del_range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
   &lt;span class="n"&gt;new_strip&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

   &lt;span class="n"&gt;il&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bisect_left&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;il&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;new_strip&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

   &lt;span class="n"&gt;iu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;bisect_right&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;iu&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

   &lt;span class="n"&gt;bounds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;il&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;iu&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new_strip&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Infinite Ranges
&lt;/h2&gt;

&lt;p&gt;As mentioned earlier, up to know we assumed that the first entry in the list of bounds corresponded to a lower bound of a range, i.e. a color change from red to black. If we maintain the information of the first color change explicitly, we can represent negation, ranges starting at minus infinity or ranges running up to infinity with minimal extra effort.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;I haven't done an extensive search, so I would not be surprised if anyone else has published the same data structure and algorithm before. If you know about such a cross-reference let me know, nd I'll update the links.&lt;/p&gt;

</description>
      <category>python</category>
      <category>datastructures</category>
      <category>algorithms</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
