<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yuvraj Raghuvanshi</title>
    <description>The latest articles on DEV Community by Yuvraj Raghuvanshi (@yuvrajraghuvanshis).</description>
    <link>https://dev.to/yuvrajraghuvanshis</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3905869%2F3bf3b41f-3acf-43f7-8e09-103390db5ac8.png</url>
      <title>DEV Community: Yuvraj Raghuvanshi</title>
      <link>https://dev.to/yuvrajraghuvanshis</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yuvrajraghuvanshis"/>
    <language>en</language>
    <item>
      <title>Counting a Billion Things With 1.5 Kilobytes</title>
      <dc:creator>Yuvraj Raghuvanshi</dc:creator>
      <pubDate>Fri, 15 May 2026 09:35:52 +0000</pubDate>
      <link>https://dev.to/yuvrajraghuvanshis/counting-a-billion-things-with-15-kilobytes-33ga</link>
      <guid>https://dev.to/yuvrajraghuvanshis/counting-a-billion-things-with-15-kilobytes-33ga</guid>
      <description>&lt;p&gt;Here’s a problem that sounds trivial until you think about it for a moment.&lt;/p&gt;

&lt;p&gt;You have a stream of data, user IDs hitting your API, search queries, IP addresses, product views. The stream is enormous: hundreds of millions of events per day. Someone asks: how many &lt;em&gt;unique&lt;/em&gt; users did we serve today?&lt;/p&gt;

&lt;p&gt;The naive answer is a set. Throw every user ID into a set, count its size at the end of the day. This is correct. It is also catastrophically expensive. A set of 100 million 64-bit integers takes roughly 800 megabytes of memory. Scale to a billion users and you’re at 8 gigabytes, just for one counter, just for one day. Redis running PFCOUNT on a key with a billion unique members would consume none of that. It uses 1.5 kilobytes.&lt;/p&gt;

&lt;p&gt;That number is not a typo. The algorithm behind it is called HyperLogLog, it was invented by a French mathematician named Philippe Flajolet in 2007, and it is one of the most satisfying things I’ve encountered in a while. Let’s build it from scratch.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Problem Has a Name
&lt;/h3&gt;

&lt;p&gt;This is called the &lt;strong&gt;count-distinct problem&lt;/strong&gt; , or cardinality estimation. You have a multiset (a collection where elements can repeat) and you want to know how many distinct elements it contains, without storing all of them.&lt;/p&gt;

&lt;p&gt;The exact answer always requires memory proportional to the number of distinct elements. There’s no way around that. But if you’re willing to accept a small, predictable error (say, within 2% of the true count) you can do something remarkable: use memory proportional to the &lt;em&gt;logarithm of the logarithm&lt;/em&gt; of the count. That’s where the “LogLog” in HyperLogLog comes from. It’s not a marketing name. It’s a description of the memory complexity.&lt;/p&gt;

&lt;p&gt;To understand how this is possible, we need to start with a coin flip.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Coin Flip Observation
&lt;/h3&gt;

&lt;p&gt;Imagine flipping a fair coin repeatedly until you get heads, and recording how many tails you saw before the first heads. Call that number k.&lt;/p&gt;

&lt;p&gt;If k = 0, you got heads immediately. Probability: 1/2. If k = 1, you got one tail then heads. Probability: 1/4. If k = 2, two tails then heads. Probability: 1/8.&lt;/p&gt;

&lt;p&gt;The probability of seeing k leading tails before the first heads is 1 / 2^(k+1). Which means: if the longest run of leading tails you've &lt;em&gt;ever&lt;/em&gt; seen across many experiments is k, you've probably run roughly 2^k experiments.&lt;/p&gt;

&lt;p&gt;See where this is going?&lt;/p&gt;

&lt;p&gt;If we hash every element in our stream to a sequence of bits, those bits behave like coin flips, roughly half start with 0, a quarter start with 00, an eighth start with 000. If the longest run of leading zeros we’ve seen in any hash is k, our estimate for how many distinct elements we've processed is 2^k.&lt;/p&gt;

&lt;p&gt;Let’s check the intuition with real numbers. Say we have 1,000 distinct elements. The probability that at least one of them hashes to a value starting with 10 leading zeros is very high, because 2^10 = 1024 ≈ 1000. The probability that any of them hashes to 20 leading zeros is astronomically low (2^20 = 1,048,576) much larger than our actual count.&lt;/p&gt;

&lt;p&gt;So the maximum run of leading zeros is a surprisingly good estimator of the order of magnitude of distinct elements seen. The key insight is that we only need to store one number (the current maximum) regardless of how many elements we’ve seen.&lt;/p&gt;

&lt;p&gt;Here’s the simplest possible version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# util.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_leading_zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash_bits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bit_length&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Count leading zero bits in a hash value.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;hash_bits&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;bit_length&lt;/span&gt;
    &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bit_length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash_bits&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;count&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hash_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Hash an element to a 32-bit integer.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;md5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Take first 32 bits
&lt;/span&gt;
&lt;span class="c1"&gt;# naive.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;util&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;count_leading_zeros&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hash_element&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NaiveEstimator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_leading_zeros&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;hash_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;count_leading_zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_leading_zeros&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_leading_zeros&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;estimate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_leading_zeros&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s test it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# test.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;naive&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;NaiveEstimator&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;random_id&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ascii_lowercase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;naive_estimator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;NaiveEstimator&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;true_set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100_000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;elem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;random_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;naive_estimator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;elem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;true_set&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;elem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;True count: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;true_set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;est&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;naive_estimator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;estimate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;est&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;true_set&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;true_set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[naive] &lt;/span&gt;&lt;span class="se"&gt;\t&lt;/span&gt;&lt;span class="s"&gt;Estimated count: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;est&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[naive] &lt;/span&gt;&lt;span class="se"&gt;\t&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running this a few times gives results like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;True count: 100000
[naive] Estimated count: 32768 ← 2^15
[naive] Error: 67.23%

True count: 100000
[naive] Estimated count: 65536 ← 2^16
[naive] Error: 34.46%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It’s correct to within an order of magnitude, consistently. But notice the problem: the estimate can only ever be a power of 2. It jumps from 32,768 to 65,536 with no values in between. The estimate has extremely high variance, a single unlucky hash producing extra leading zeros throws everything off. This is the Flajolet-Martin algorithm from 1984. It works, but it’s rough.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reducing Variance: Buckets
&lt;/h3&gt;

&lt;p&gt;The solution to high variance is the same as it always is in statistics: take more samples and average them.&lt;/p&gt;

&lt;p&gt;One approach: run multiple independent hash functions and average the results. But hashing everything multiple times is expensive.&lt;/p&gt;

&lt;p&gt;A smarter approach, from Flajolet and Durand’s 2003 LogLog paper: use a single hash, but split it into two parts. Use the first b bits to choose a bucket (one of m = 2^b buckets), and run the leading-zero counter on the &lt;em&gt;remaining&lt;/em&gt; bits.&lt;/p&gt;

&lt;p&gt;Each bucket independently estimates the cardinality from the subset of elements that landed in it. We then combine the buckets by averaging. We’ve effectively gotten m independent estimates from a single hash function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# loglog.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;util&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;count_leading_zeros&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hash_element&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LogLogEstimator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        b: number of bits used for bucket index
        m: number of buckets = 2^b
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buckets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;hash_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# First b bits → bucket index
&lt;/span&gt;        &lt;span class="n"&gt;bucket_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Remaining 32-b bits → count leading zeros
&lt;/span&gt;        &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;leading_zeros&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;count_leading_zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Keep the maximum for this bucket
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;bucket_index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;bucket_index&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;leading_zeros&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;estimate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# LogLog: geometric mean across buckets
&lt;/span&gt;        &lt;span class="n"&gt;avg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With b = 8, we have 256 buckets, each storing one small integer (max ~32). Total memory: 256 bytes. And the estimates are dramatically more stable than the naive version. The standard error of LogLog is 1.3 / sqrt(m), with 256 buckets, that's about 8%.&lt;/p&gt;

&lt;p&gt;But we can do better. Still in the same 2003 paper, Flajolet noticed that the largest bucket values are outliers that inflate the estimate. He suggested keeping only the bottom 70% of bucket values for the average. This is &lt;strong&gt;SuperLogLog&lt;/strong&gt; , and it reduces the error to 1.05 / sqrt(m), about 6.5% with 256 buckets, with no memory increase.&lt;/p&gt;

&lt;p&gt;Then came 2007.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Harmonic Mean: The HyperLogLog Insight
&lt;/h3&gt;

&lt;p&gt;The jump from SuperLogLog to HyperLogLog is a single change: replace the geometric mean with the &lt;strong&gt;harmonic mean&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The harmonic mean of a set of values is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;n / (1/v1 + 1/v2 + ... + 1/vn)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why does this help? The harmonic mean is less sensitive to large outliers than the geometric mean. When one bucket has seen an unusually large leading-zero count, because one element happened to hash to something that starts with twelve zeros, that bucket’s contribution to the harmonic mean is 1/2^12, which is tiny. It barely moves the needle. The geometric mean (averaging the exponents) would give it far more weight.&lt;/p&gt;

&lt;p&gt;The HyperLogLog estimate is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;estimate = correction_factor * m^2 * 1 / sum(2^(-bucket[i]) for all i)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where correction_factor is a constant that depends on m (approximately 0.7213 for large m), and the whole formula is just the harmonic mean of 2^bucket[i] values, scaled.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# hyperloglog.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;util&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;count_leading_zeros&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hash_element&lt;/span&gt;

&lt;span class="c1"&gt;# Correction factors per the original paper
&lt;/span&gt;&lt;span class="n"&gt;CORRECTION_FACTORS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.673&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.697&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.709&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;HyperLogLog&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        b: number of index bits (between 4 and 16)
        m = 2^b buckets
        Standard error ≈ 1.04 / sqrt(m)
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;b must be between 4 and 16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buckets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;CORRECTION_FACTORS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CORRECTION_FACTORS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# for m &amp;gt;= 128
&lt;/span&gt;            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7213&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;1.079&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;hash_element&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# First b bits → bucket index
&lt;/span&gt;        &lt;span class="n"&gt;bucket_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Remaining bits → position of leftmost 1-bit
&lt;/span&gt;        &lt;span class="c1"&gt;# (= 1 + number of leading zeros in remaining bits)
&lt;/span&gt;        &lt;span class="n"&gt;remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;rho&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;count_leading_zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;bucket_index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;bucket_index&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;rho&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;estimate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Harmonic mean of 2^bucket values
&lt;/span&gt;        &lt;span class="n"&gt;harmonic_sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;raw_estimate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;alpha&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;harmonic_sum&lt;/span&gt;

        &lt;span class="c1"&gt;# Small range correction: use LinearCounting when estimate is small
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw_estimate&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mf"&gt;2.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;zeros&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;zeros&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Large range correction (for 32-bit hashes)
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;raw_estimate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;30.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;raw_estimate&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;raw_estimate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two corrections are applied at the edges:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Small range correction&lt;/strong&gt; : When the estimate is less than 2.5x the number of buckets, many buckets are still empty (zero). In that regime, a separate algorithm called LinearCounting is more accurate. LinearCounting uses the number of empty buckets: m * ln(m / empty_buckets). HyperLogLog switches to it automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Large range correction&lt;/strong&gt; : When the estimate approaches the maximum value representable by a 32-bit hash (about 4 billion), hash collisions start causing systematic undercount. A logarithmic correction compensates.&lt;/p&gt;

&lt;p&gt;Now let’s test this properly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# test.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;string&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;hyperloglog&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HyperLogLog&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;random_id&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ascii_lowercase&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;hyper_log_log&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HyperLogLog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# 1024 buckets, ~1KB
&lt;/span&gt;&lt;span class="n"&gt;true_set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100_000&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;elem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;random_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;hyper_log_log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;elem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;true_set&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;elem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;True count: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;true_set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;est&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hyper_log_log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;estimate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;est&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;true_set&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;true_set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[hyper log log] &lt;/span&gt;&lt;span class="se"&gt;\t&lt;/span&gt;&lt;span class="s"&gt;Estimated count: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;est&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[hyper log log] &lt;/span&gt;&lt;span class="se"&gt;\t&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[hyper log log] &lt;/span&gt;&lt;span class="se"&gt;\t&lt;/span&gt;&lt;span class="s"&gt;Memory (approx): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hyper_log_log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; bytes for buckets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;True count: 100000

[hyper log log] Estimated count: 97,291
[hyper log log] Error: 2.71%
[hyper log log] Memory (approx): 1024 bytes for buckets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;1024 bytes. About 2% error in this particular run. On 100,000 unique elements.&lt;/p&gt;

&lt;p&gt;With b = 10 (1024 buckets), the theoretical standard error is 1.04 / sqrt(1024) = 3.25%. In practice it often lands well within that. With b = 12 (4096 buckets, still only 4KB), the standard error drops to 1.04 / sqrt(4096) = 1.6%.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Memory Arithmetic
&lt;/h3&gt;

&lt;p&gt;Let’s be precise about why this is so small.&lt;/p&gt;

&lt;p&gt;Each bucket stores the maximum run of leading zeros it’s seen. With a 32-bit hash and b bits used for the index, the remaining 32 - b bits are used for the zero count. The maximum possible count is 32 - b, which for b = 10 is 22. You need 5 bits to represent numbers up to 22.&lt;/p&gt;

&lt;p&gt;So each bucket needs 5 bits. With 1024 buckets: 1024 * 5 = 5120 bits = 640 bytes.&lt;/p&gt;

&lt;p&gt;In practice Redis uses 6 bits per register and 16,384 registers (b = 14): 16384 * 6 = 98,304 bits = 12,288 bytes = 12KB. With this configuration the standard error is 1.04 / sqrt(16384) = 0.81%. Under 1% error, for a 12KB data structure, counting up to billions of distinct elements.&lt;/p&gt;

&lt;p&gt;Redis also uses a dense/sparse encoding: when few elements have been added, only the non-zero buckets are stored. The 12KB limit is only reached with a large number of distinct elements.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Property That Makes It Production-Useful
&lt;/h3&gt;

&lt;p&gt;There’s one more thing that makes HyperLogLog more than just a clever approximation: HyperLogLog sketches are &lt;strong&gt;mergeable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you have a HyperLogLog for Monday’s traffic and one for Tuesday’s traffic, you can merge them into a HyperLogLog for Monday+Tuesday’s combined unique users by taking the element-wise maximum of the two bucket arrays:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hll1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;HyperLogLog&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hll2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;HyperLogLog&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;HyperLogLog&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;hll1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;hll2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Can only merge HLLs with same b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;merged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HyperLogLog&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hll1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;merged&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buckets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;b1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b2&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hll1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hll2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;merged&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This takes O(m) time and produces an estimate with the same error guarantee as a fresh HyperLogLog that had seen both datasets. No re-processing. No storing the original data.&lt;/p&gt;

&lt;p&gt;This is why it appears in distributed systems everywhere. You can compute HyperLogLog sketches independently on shards, machines, or time windows, then merge them instantly. Reddit uses this for per-post unique view counts distributed across servers. BigQuery uses it for APPROX_COUNT_DISTINCT(). Facebook Presto, Apache Druid, Amazon Redshift - they all have it.&lt;/p&gt;

&lt;p&gt;In Redis, PFADD adds elements and PFMERGE merges two HLLs. The PF prefix is a tribute to Philippe Flajolet, who died in 2011, four years after publishing the paper that named the algorithm.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I Actually Took From This
&lt;/h3&gt;

&lt;p&gt;There’s a category of algorithms where the core idea is so simple that it seems like it can’t possibly work, and the whole experience of learning it is going from skepticism to surprise to understanding. Counting distinct elements by tracking coin-flip statistics is in that category.&lt;/p&gt;

&lt;p&gt;The thing I keep thinking about is the memory arithmetic. A set() in Python holding 100,000 integers uses roughly 4MB. The HyperLogLog above used 1KB and got within 2%. The set uses 4,000x more memory for exact precision. Whether that precision is worth 4,000x the memory depends entirely on what you're building.&lt;/p&gt;

&lt;p&gt;For a unique user counter that needs to answer in real time, served from Redis, on billions of events, it’s not worth it. For a billing system that needs to charge per unique user to the cent, it might be. Knowing which situation you’re in is the actual engineering skill. HyperLogLog is just a tool. A very elegant one.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is rewritten using AI chatbots.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>algorithms</category>
      <category>backend</category>
      <category>datastructures</category>
    </item>
    <item>
      <title>I Had to Run a Blockchain on My Laptop, So I Put It in Kubernetes</title>
      <dc:creator>Yuvraj Raghuvanshi</dc:creator>
      <pubDate>Wed, 06 May 2026 06:38:06 +0000</pubDate>
      <link>https://dev.to/yuvrajraghuvanshis/i-had-to-run-a-blockchain-on-my-laptop-so-i-put-it-in-kubernetes-2d0j</link>
      <guid>https://dev.to/yuvrajraghuvanshis/i-had-to-run-a-blockchain-on-my-laptop-so-i-put-it-in-kubernetes-2d0j</guid>
      <description>&lt;p&gt;The assignment was to build a ticket booking system using Hyperledger Fabric. Two entities (travel agencies and customers) a shared ledger, and a requirement that every booking be verifiable on the blockchain. We had weeks to do it.&lt;/p&gt;

&lt;p&gt;I did not plan to spend the first few weeks fighting infrastructure before writing a single line of business logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Hyperledger Fabric Actually Is
&lt;/h3&gt;

&lt;p&gt;Before I get into what went wrong, it’s worth explaining what Hyperledger Fabric is, because it’s quite different from what most people imagine when they hear “blockchain.”&lt;/p&gt;

&lt;p&gt;When people think of blockchain, they usually think of Bitcoin or Ethereum: a public network anyone can join, where transactions are anonymous, and where consensus is reached through computational work. Hyperledger Fabric is none of those things. It’s a &lt;em&gt;permissioned&lt;/em&gt; blockchain framework — every participant must be explicitly identified and credentialed before they can interact with the network. There are no anonymous transactions. There is no mining.&lt;/p&gt;

&lt;p&gt;Fabric’s target audience is enterprises and consortiums. Think of a group of banks that want to record inter-bank settlements on a shared ledger, or airlines and travel agencies that want a single source of truth for ticket inventory. Each of those organizations runs their own nodes, retains control of their own data, and collectively agrees on what gets written to the ledger. No one organization controls the chain.&lt;/p&gt;

&lt;p&gt;The core building blocks of a Fabric network are:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Peers&lt;/strong&gt; : The actual nodes that store a copy of the ledger and execute the smart contracts (called “chaincode” in Fabric). Each organization in the network runs one or more peers. If you have two organizations, you have at least two sets of peers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orderers&lt;/strong&gt; : A separate cluster of nodes whose only job is to sequence transactions and package them into blocks. Peers don’t talk to each other to agree on order; they send transactions to the orderers, who handle that. The orderers use a consensus algorithm called Raft — the same one used in databases like etcd — where one node is elected leader and the others follow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Certificate Authorities (CAs)&lt;/strong&gt;: Since every participant must be credentialed, Fabric runs a CA for each organization. These issue the cryptographic identities (X.509 certificates) that peers, orderers, and users present when making any request. No valid certificate, no access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Channels&lt;/strong&gt; : A Fabric network can have multiple independent sub-ledgers called channels. Each channel has its own blockchain, its own set of members, and its own chaincode. In this project there’s one channel: mychannel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chaincode&lt;/strong&gt; : The smart contracts. These are programs that run on the peers and define what operations can be performed on the ledger. In Fabric, chaincode is written in a real programming language (Go, Java, TypeScript). When a client wants to record a booking, it calls a chaincode function. The chaincode executes on the peer, validates the inputs, and writes to the ledger.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;World State&lt;/strong&gt; : The current state of all data, stored in a database (CouchDB in this project). When chaincode writes data, it goes into the world state. The blockchain itself records the &lt;em&gt;history&lt;/em&gt; of every transaction; the world state is the up-to-date snapshot.&lt;/p&gt;

&lt;p&gt;For this project, the network has three organizations. Org0 runs the ordering service — three orderer nodes using Raft consensus. Org1 represents travel agencies. Org2 represents customers. Each org has two peers and one CA.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw62d98xzilig7927g1jb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw62d98xzilig7927g1jb.png" alt="Diagram: Three boxes labeled Org0 (3 orderers), Org1 (2 peers + CA), Org2 (2 peers + CA), connected by a channel labeled mychannel" width="800" height="588"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Diagram: Three boxes labeled Org0 (3 orderers), Org1 (2 peers + CA), Org2 (2 peers + CA), connected by a channel labeled mychannel&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Problem With “Simple”
&lt;/h3&gt;

&lt;p&gt;Fabric ships with several example networks. The simplest is a Docker Compose setup that brings up a few containers on your local machine. I started there.&lt;/p&gt;

&lt;p&gt;It didn’t connect reliably. Peers couldn’t reach each other. The REST API sample couldn’t find the ledger. I tried the JavaScript version. Same issues. The Docker Compose approach works fine if you follow the tutorial exactly on a clean machine with the right Fabric binary versions. In practice, when you’re trying to connect your own code to it rather than running the provided samples, small mismatches in TLS configuration or service discovery cause silent failures that are difficult to trace.&lt;/p&gt;

&lt;p&gt;The Kubernetes-based test network (test-network-k8s in the fabric-samples repository) was the only variant that worked consistently. And once I committed to it, it solved a second problem I hadn't fully thought through: I needed to run a lot of things simultaneously. There was a customer backend, a travel agency backend, a unified frontend, the Fabric REST interface, and the Fabric network itself - eight-plus processes. Kubernetes gave me a way to run all of that in one KIND cluster (KIND is "Kubernetes IN Docker" - it runs a full Kubernetes cluster inside Docker containers on your laptop) without manually managing ports, docker networks, and process restarts.&lt;/p&gt;

&lt;p&gt;So the choice wasn’t ideological. It was pragmatic: Kubernetes was what worked, and it handled the orchestration problem for free.&lt;/p&gt;
&lt;h3&gt;
  
  
  What the Kubernetes Deployment Actually Looks Like
&lt;/h3&gt;

&lt;p&gt;Every component in this network is a Kubernetes resource. Let me show what that means concretely with the peer deployment.&lt;/p&gt;

&lt;p&gt;Each peer has three Kubernetes resources: a Certificate (for TLS), a ConfigMap (for environment config), and a Deployment that runs the actual container. Here's the ConfigMap for org1-peer1, which shows how the peer is configured:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org1-peer1-config&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;CORE_PEER_ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org1-peer1.org1.example.com&lt;/span&gt;
  &lt;span class="na"&gt;CORE_PEER_ADDRESS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org1-peer1:7051&lt;/span&gt;
  &lt;span class="na"&gt;CORE_PEER_LOCALMSPID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Org1MSP&lt;/span&gt;
  &lt;span class="na"&gt;CORE_PEER_MSPCONFIGPATH&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/var/hyperledger/fabric/organizations/...&lt;/span&gt;
  &lt;span class="na"&gt;CORE_PEER_GOSSIP_BOOTSTRAP&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org1-peer2:7051&lt;/span&gt;
  &lt;span class="na"&gt;CHAINCODE_AS_A_SERVICE_BUILDER_CONFIG&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{"peername":"org1peer1"}'&lt;/span&gt;
  &lt;span class="na"&gt;CORE_LEDGER_STATE_STATEDATABASE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CouchDB&lt;/span&gt;
  &lt;span class="na"&gt;CORE_LEDGER_STATE_COUCHDBCONFIG_COUCHDBADDRESS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;localhost:5984&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;CORE_PEER_GOSSIP_BOOTSTRAP tells this peer to connect to org1-peer2 when it starts, for gossip - the protocol peers use to share ledger state with each other. CHAINCODE_AS_A_SERVICE_BUILDER_CONFIG tells the peer the name of &lt;em&gt;this specific peer&lt;/em&gt; so the chaincode deployment knows which sidecar belongs to which peer. The CouchDB configuration is because the world state for this project is stored in CouchDB rather than the default LevelDB, which gives richer query capability.&lt;/p&gt;

&lt;p&gt;The Deployment spec itself is interesting because it runs &lt;em&gt;two&lt;/em&gt; containers in the same pod:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${FABRIC_PEER_IMAGE}&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7051&lt;/span&gt; &lt;span class="c1"&gt;# gRPC for clients&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;7052&lt;/span&gt; &lt;span class="c1"&gt;# gRPC for chaincode&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9443&lt;/span&gt; &lt;span class="c1"&gt;# operations/health&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;couchdb&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;couchdb:${COUCHDB_VERSION}&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;COUCHDB_USER&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;admin&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;COUCHDB_PASSWORD&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;adminpw&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5984&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The peer and its CouchDB instance are co-located in the same pod. CouchDB is accessed at localhost:5984 from inside the peer container - they share a network namespace since they're in the same pod. This is standard Kubernetes sidecar pattern.&lt;/p&gt;

&lt;p&gt;The TLS certificate for the peer is handled by cert-manager, a Kubernetes add-on that automates certificate issuance. Each peer gets a certificate with multiple DNS names:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cert-manager.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Certificate&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org1-peer1-tls-cert&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dnsNames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;localhost&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;org1-peer1&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;org1-peer1.test-network.svc.cluster.local&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;org1-peer1.localho.st&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;org1-peer-gateway-svc&lt;/span&gt;
  &lt;span class="na"&gt;secretName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org1-peer1-tls-cert&lt;/span&gt;
  &lt;span class="na"&gt;issuerRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org1-tls-cert-issuer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The certificate needs to cover all the names by which clients might reach this peer — inside the cluster, outside the cluster via ingress, and through the gateway service. TLS validation will reject a connection if the hostname the client is connecting to doesn’t match a name in the certificate. This is relevant because a single TLS handshake failure cascades into completely opaque errors that look like connection refused.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7l1jijrcznd4s3kzs0ac.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7l1jijrcznd4s3kzs0ac.png" alt="Screenshot: kubectl -n test-network get pods — showing the full list of running pods" width="800" height="352"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: kubectl -n test-network get pods — showing the full list of running pods&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Chaincode as a Service: Why the Chaincode Runs as Its Own Pod
&lt;/h3&gt;

&lt;p&gt;In traditional Fabric deployments, the peer launches chaincode directly using Docker — when you install chaincode on a peer, the peer builds a Docker image and spins up a container. This is problematic in Kubernetes because it requires the peer container to have access to a Docker daemon (Docker-in-Docker), which is complex and generally frowned upon.&lt;/p&gt;

&lt;p&gt;The solution is Chaincode as a Service (CCaaS). Instead of the peer spawning the chaincode, the chaincode runs as its own Kubernetes Deployment and exposes a gRPC server on port 9999. The peer connects to it at a known address. The chaincode is just another pod in the cluster.&lt;/p&gt;

&lt;p&gt;The address is specified in a connection.json file that gets bundled into the chaincode package before deployment:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"address"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{{.peername}}-ccaas-chaincode:9999"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dial_timeout"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"10s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tls_required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;code&gt;{{.peername}}&lt;/code&gt; placeholder is substituted at packaging time - so org1peer1 becomes the address org1peer1-ccaas-chaincode:9999. The peer knows exactly which Kubernetes service to connect to.&lt;/p&gt;

&lt;p&gt;The chaincode Kubernetes deployment is generated from a template, with the chaincode name, ID, and image substituted in by the deployment script:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org1{{PEER_NAME}}-ccaas-{{CHAINCODE_NAME}}&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;CHAINCODE_IMAGE&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
          &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CHAINCODE_SERVER_ADDRESS&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:9999&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CHAINCODE_ID&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;CHAINCODE_ID&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org1{{PEER_NAME}}-ccaas-{{CHAINCODE_NAME}}&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;chaincode&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9999&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;One deployment and one service per peer, per org. With two peers per org and two orgs (org1 and org2), that’s four chaincode sidecar pods in total — org1peer1-ccaas-chaincode, org1peer2-ccaas-chaincode, org2peer1-ccaas-chaincode, org2peer2-ccaas-chaincode.&lt;/p&gt;

&lt;p&gt;The CHAINCODE_ID is the critical environment variable. It's computed as sha256(chaincode.tgz) - the hash of the packaged chaincode archive. The peer and the chaincode container must agree on this value; if they don't match, the peer refuses to talk to the chaincode container.&lt;/p&gt;
&lt;h3&gt;
  
  
  Writing the Chaincode: What fabric-contract-api Is
&lt;/h3&gt;

&lt;p&gt;The chaincode is written in TypeScript using a library called fabric-contract-api. Before getting into what this does, it helps to understand the problem it's solving.&lt;/p&gt;

&lt;p&gt;Fabric chaincode communicates with the peer over gRPC — a low-level binary protocol. Without a framework, you’d be implementing the gRPC server yourself, handling message serialization, managing the chaincode lifecycle protocol, and making raw putState and getState calls. It's doable but tedious and error-prone.&lt;/p&gt;

&lt;p&gt;fabric-contract-api wraps all of that. It lets you write a TypeScript class where each method is a smart contract function. You decorate the class and its properties with @Object, @Property, @Transaction, and @Info, and the framework handles the gRPC plumbing, serialization, and lifecycle.&lt;/p&gt;

&lt;p&gt;Here is what the booking data model looks like with these decorators:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;Property&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;fabric-contract-api&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Booking&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;bookingID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;First Booking&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;userID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;First User&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;userHash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;isUserAnonymous&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;agencyID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;travelID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;seatNumbers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1A,1B&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;totalPrice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;transactionID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Confirmed&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;updatedAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;cancelledAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;refundAmount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;penalty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;availableSeats&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Property&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;hyperledgerTxId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The @Object() decorator tells the framework this class represents a ledger asset. The @Property() decorators tell it which fields to include in serialization. When you call ctx.stub.putState(bookingID, Buffer.from(JSON.stringify(booking))), this object gets serialized to JSON and written to the world state under the bookingID key. Reading it back is just ctx.stub.getState(bookingID) and parsing the result.&lt;/p&gt;

&lt;p&gt;The contract itself uses @Transaction() to mark functions that write to the ledger (submit transactions) and @Transaction(false) for read-only queries (evaluate transactions):&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;Contract&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;Transaction&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;fabric-contract-api&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;BookingContract&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Smart contract for recording travel bookings&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BookingContract&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Contract&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nc"&gt;RecordBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;bookingID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;userID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;isUserAnonymous&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;// ... other fields&lt;/span&gt;
  &lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;booking&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Booking&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nx"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;bookingID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;bookingID&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// ... assign fields&lt;/span&gt;
    &lt;span class="nx"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hyperledgerTxId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getTxID&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;putState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nx"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;bookingID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="nx"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;booking&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nc"&gt;ReadBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;bookingID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bookingID&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Booking &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;bookingID&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; does not exist`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nc"&gt;BookingExists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;bookingID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bookingID&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nc"&gt;GetAllBookings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;iterator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getStateByRange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;bookings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;iterator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;done&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;bookings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;()));&lt;/span&gt;
      &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;iterator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;iterator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bookings&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nc"&gt;DeleteBooking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;bookingID&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;exists&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;BookingExists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;bookingID&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`The booking &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;bookingID&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; does not exist`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stub&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;deleteState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bookingID&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The ctx.stub.getTxID() call inside RecordBooking is important. When Fabric commits a transaction, it assigns a unique transaction ID to it. By capturing this inside the chaincode and storing it as hyperledgerTxId in the booking record, we can later look up exactly which block this booking is in. That's what the block height endpoint does.&lt;/p&gt;

&lt;p&gt;GetAllBookings uses getStateByRange('', '') with empty strings for both bounds - that means "all keys." It returns a cursor-based iterator rather than loading everything at once, which matters if the ledger grows large.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Privacy Design: Hashing User Identity
&lt;/h3&gt;

&lt;p&gt;There’s a field in the Booking object called userHash, and a boolean called isUserAnonymous. This needs explaining.&lt;/p&gt;

&lt;p&gt;The Fabric ledger in this architecture is not private — all peers in the network can read all bookings. Org1 (travel agencies) and Org2 (customers) share the same channel and therefore the same ledger. If a booking stored userName: "Yuvraj Raghuvanshi" and userEmail: "&lt;a href="mailto:yuvraj@example.com"&gt;yuvraj@example.com&lt;/a&gt;" directly in the ledger entry, then every peer operator - including the travel agencies - could read that personal information.&lt;/p&gt;

&lt;p&gt;Bitcoin has the same problem and solves it with a hash: your identity on the Bitcoin network is a hash of your public key, not your name. I used the same idea here. By default, only a hash of the user’s identifier is written to the ledger. The actual name and email stay in the customer backend’s database, which the travel agencies can’t access. If the user explicitly opts out of anonymity (isUserAnonymous: false), their internal application userID is written instead - still not their name or email, just an opaque identifier.&lt;/p&gt;

&lt;p&gt;The personal information lives in the application layer. The ledger records that &lt;em&gt;a booking was made&lt;/em&gt;. If you need to verify who made it, you go through the application, not the ledger directly.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Chaincode Lifecycle: A Five-Step Process
&lt;/h3&gt;

&lt;p&gt;This is one of the parts of Fabric that confuses people most. Deploying chaincode is not like deploying a Docker image. There is a formal five-step governance process:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Package&lt;/strong&gt; : Bundle the chaincode source (or in CCaaS mode, the connection.json and metadata.json) into a .tgz archive. Compute its SHA256 hash - this becomes the CHAINCODE_ID.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Install&lt;/strong&gt; : Copy the package to each peer that will run the chaincode. The peer stores it locally but doesn’t activate it yet.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Approve&lt;/strong&gt; : Each organization’s admin issues a vote approving the chaincode definition: this name, this version, this sequence number. In a real multi-party network, each organization does this independently. The channel’s endorsement policy determines how many approvals are needed before the chaincode can be committed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Commit&lt;/strong&gt; : Once enough organizations have approved, one admin commits the chaincode definition to the channel. This is a channel-wide operation: after commit, all peers on the channel recognize the chaincode as active.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Launch (CCaaS only)&lt;/strong&gt;: In CCaaS mode, after the lifecycle is complete, the chaincode container must actually be running. The peer connects to it over gRPC at the address from connection.json.&lt;/p&gt;

&lt;p&gt;Here is the approval step in the deployment script, showing the key arguments:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;peer lifecycle chaincode approveformyorg &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--channelID&lt;/span&gt; mychannel &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; chaincode &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--package-id&lt;/span&gt; chaincode:&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;sha256_of_package&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--sequence&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;next_seq_num&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--orderer&lt;/span&gt; org0-orderer1.localho.st:443 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tls&lt;/span&gt; &lt;span class="nt"&gt;--cafile&lt;/span&gt; /path/to/org0-tls-ca.pem
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The --sequence argument is where I ran into a concrete problem.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Sequence Number Bug
&lt;/h3&gt;

&lt;p&gt;When you first deploy chaincode, --sequence 1 is correct. The sequence number tracks how many times the chaincode definition has been updated on the channel. First deployment: 1. First update: 2. And so on.&lt;/p&gt;

&lt;p&gt;The original chaincode.sh from fabric-samples had --sequence 1 hardcoded everywhere - in both the approveformyorg and commit commands. This works exactly once. The reset script tears down the KIND cluster and rebuilds everything from scratch, so each reset starts fresh - which means sequence 1 is always correct after a full reset.&lt;/p&gt;

&lt;p&gt;But Fabric also supports &lt;em&gt;incremental&lt;/em&gt; chaincode updates without tearing down the cluster. If you install a new version of chaincode on a running network, you need to increment the sequence. With the hardcoded 1, this would fail.&lt;/p&gt;

&lt;p&gt;The fix was to query the current committed sequence and increment it:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;function &lt;/span&gt;get_next_sequence&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;channel&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;cc_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$2&lt;/span&gt;

  export_peer_context org1 peer1

  &lt;span class="nv"&gt;current_seq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;peer lifecycle chaincode querycommitted &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-C&lt;/span&gt; &lt;span class="nv"&gt;$channel&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="nv"&gt;$cc_name&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--output&lt;/span&gt; json 2&amp;gt;/dev/null | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.sequence'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;0
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="k"&gt;$((&lt;/span&gt;current_seq &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="k"&gt;))&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If the chaincode hasn’t been committed yet, querycommitted returns nothing, we default to 0, and the next sequence is 1. If it's already been committed at sequence 1, the next sequence is 2. This makes incremental updates possible without touching the cluster.&lt;/p&gt;
&lt;h3&gt;
  
  
  Deploying to Both Orgs
&lt;/h3&gt;

&lt;p&gt;The original chaincode.sh installed and approved chaincode only for org1. This is a problem.&lt;/p&gt;

&lt;p&gt;The reason goes back to Fabric’s endorsement policy. When a client submits a transaction to the ledger, it doesn’t go directly to one peer. It goes to multiple peers for &lt;em&gt;endorsement&lt;/em&gt; first. Each endorsing peer executes the chaincode, signs the result, and sends it back. The client collects enough endorsements to satisfy the policy, then sends the endorsed transaction to the orderers for ordering and commit.&lt;/p&gt;

&lt;p&gt;If the endorsement policy requires signatures from both org1 and org2 — which is the correct setup for a two-party booking system — then org2 peers must also have the chaincode installed and approved. Otherwise, they can’t endorse transactions, and the policy is never satisfied.&lt;/p&gt;

&lt;p&gt;I knew this from reading the architecture documentation before writing a line of deployment code. The fix was to loop over both orgs everywhere:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;function &lt;/span&gt;install_chaincode&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;cc_package&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;org &lt;span class="k"&gt;in &lt;/span&gt;org1 org2&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;install_chaincode_for &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;org&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; peer1 &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;cc_package&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;
    install_chaincode_for &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;org&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; peer2 &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;cc_package&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;done&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;function &lt;/span&gt;approve_chaincode&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;cc_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;cc_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$2&lt;/span&gt;
  &lt;span class="nb"&gt;local &lt;/span&gt;&lt;span class="nv"&gt;next_seq_num&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$3&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;org &lt;span class="k"&gt;in &lt;/span&gt;org1 org2&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;export_peer_context &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;org&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; peer1
    peer lifecycle chaincode approveformyorg &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--channelID&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CHANNEL_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;cc_name&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--version&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--package-id&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;cc_id&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      &lt;span class="nt"&gt;--sequence&lt;/span&gt; &lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;next_seq_num&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      ...
  &lt;span class="k"&gt;done&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;And for the CCaaS launches, org2 also needs its own chaincode sidecar pods. This required a separate org2-cc-template.yaml with the same structure as org1-cc-template.yaml but with org2 substituted throughout. Four CCaaS pods in total: org1peer1-ccaas-chaincode, org1peer2-ccaas-chaincode, org2peer1-ccaas-chaincode, org2peer2-ccaas-chaincode.&lt;/p&gt;
&lt;h3&gt;
  
  
  The REST Interface: Bridging the Fabric SDK and HTTP
&lt;/h3&gt;

&lt;p&gt;Fabric doesn’t expose an HTTP API. The Fabric Node SDK communicates with peers over gRPC directly. To let the application backends interact with the blockchain over HTTP, there’s a separate service (the network REST interface) that wraps the SDK in an Express server.&lt;/p&gt;

&lt;p&gt;I forked fabric-samples/asset-transfer-basic/rest-api-typescript, which had the right structure already. It manages two long-lived gRPC connections to the network (one authenticated as an org1 identity, one as an org2 identity) and keeps them open for the life of the server. Creating new connections per request is expensive and the wrong pattern with Fabric's SDK.&lt;/p&gt;

&lt;p&gt;The key design decision in the original sample that I kept was the async job queue. Submitting a transaction to Fabric is not instant. The request goes to a peer for endorsement, then to the orderers for ordering, then to peers for validation and commit. This can take a few seconds. A naive synchronous REST endpoint would time out.&lt;/p&gt;

&lt;p&gt;The solution is to return 202 Accepted immediately with a job ID, and queue the transaction for background processing. The caller polls /api/jobs/:jobId to find out when it's done:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;POST /api/bookings
→ 202 Accepted, { jobId: "42" }

GET /api/jobs/42
→ { status: "complete", transactionId: "abc123..." }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The queue is implemented with BullMQ, which uses Redis as a backend. Each submitted transaction is a job in the queue. A worker process picks jobs off the queue, submits them to Fabric, and writes the result back to Redis. The job status endpoint reads from Redis.&lt;/p&gt;

&lt;p&gt;The booking router handles authentication via API keys mapped to org identities. An API key for org1 tells the server to use the org1 connection profile and sign transactions with the org1 admin certificate. The key is passed as an X-Api-Key header. Both the customer backend and the travel agency backend have their own API keys.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// From auth.ts&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;apiKeyOrgs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;ORG1_APIKEY&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Org1MSP&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;ORG2_APIKEY&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Org2MSP&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The endpoints the booking router exposes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GET /api/bookings - Evaluate GetAllBookings chaincode function&lt;/li&gt;
&lt;li&gt;GET /api/bookings/:bookingID - Evaluate ReadBooking&lt;/li&gt;
&lt;li&gt;POST /api/bookings - Submit RecordBooking (queued)&lt;/li&gt;
&lt;li&gt;DELETE /api/bookings/:bookingID - Submit DeleteBooking (queued)&lt;/li&gt;
&lt;li&gt;GET /api/bookings/:hyperledgerTxId/blockheight - Query block position&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Reading the Blockchain: Block Height and QSCC
&lt;/h3&gt;

&lt;p&gt;The assignment required that bookings be verifiable on the blockchain. The simplest form of verification is proving not just that a booking exists in the world state, but that it was committed in a specific block that has subsequent blocks built on top of it.&lt;/p&gt;

&lt;p&gt;Fabric has a system chaincode called QSCC — Query System Chaincode. It’s a built-in chaincode that runs on every peer and exposes ledger metadata. You can query it to find which block contains a given transaction, or to get the current height of the chain.&lt;/p&gt;

&lt;p&gt;The block height endpoint works like this: take the hyperledgerTxId stored in the booking, call QSCC's GetBlockByTxID function on the peer, and decode the returned protobuf to find the block number. Then call GetChainInfo to find the current chain height. The difference tells you how many blocks have been added since this booking was committed.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;bookingsRouter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/:hyperledgerTxId/blockheight&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;contract&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;locals&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;mspId&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;qsccContract&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;Contract&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hyperledgerTxId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hyperledgerTxId&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Ask QSCC: which block contains this transaction?&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;blockBytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;contract&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluateTransaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;GetBlockByTxID&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;mychannel&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;hyperledgerTxId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// QSCC returns raw protobuf bytes - decode with fabric-protos&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;block&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;common&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;blockBytes&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;blockHeight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;header&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="c1"&gt;// Get current chain height&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chainInfo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;common&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;BlockchainInfo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;contract&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluateTransaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;GetChainInfo&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;mychannel&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;currentHeight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;chainInfo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;height&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;OK&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;hyperledgerTxId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;blockHeight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// Block where this booking was committed&lt;/span&gt;
    &lt;span class="na"&gt;blockchainHeight&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;currentHeight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// Current chain height&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The protobuf decoding uses fabric-protos, a package that contains the compiled protobuf definitions for all Fabric message types. common.Block.decode(blockBytes) takes the raw bytes from QSCC and gives you a structured object with header.number as the block index.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3sgj6a9d5zus8sp8zrki.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3sgj6a9d5zus8sp8zrki.png" width="800" height="366"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: Webapp (Fabric REST interface) showing hyperledgerTxId, blockHeight: 5, blockchainHeight: 7 — meaning 2 blocks have been added since this booking&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Reset Script and 30 Hours of Debugging
&lt;/h3&gt;

&lt;p&gt;Every configuration change, every chaincode update, every time something was broken beyond quick repair — the reset script. It tears down everything and rebuilds:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./network down &lt;span class="c"&gt;# Bring down peers, orderers, chaincode&lt;/span&gt;
./network unkind &lt;span class="c"&gt;# Delete the KIND cluster&lt;/span&gt;
./network kind &lt;span class="c"&gt;# Create a new KIND cluster&lt;/span&gt;
./network cluster init &lt;span class="c"&gt;# Install cert-manager, nginx, set up namespaces&lt;/span&gt;
./network up &lt;span class="c"&gt;# Launch CAs, peers, orderers&lt;/span&gt;
./network channel create &lt;span class="c"&gt;# Create mychannel, join peers&lt;/span&gt;
./network chaincode deploy chaincode chaincode/ &lt;span class="c"&gt;# Full chaincode lifecycle&lt;/span&gt;
./network rest-easy &lt;span class="c"&gt;# Build and deploy the REST interface&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; test-network port-forward svc/fabric-rest-sample 3003:3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;From scratch, this takes about fifteen minutes. I ran it a lot.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;yuvraj@Windows-11:~/mytravel/hyperledger$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;./reset
&lt;span class="go"&gt;Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Shutting down test network "test-network":
✅ - Stopping Fabric services ...
✅ - Scrubbing Fabric volumes ...
✅ - Deleting namespace "test-network" ...
🏁 - Fabric network is down.
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Deleting KIND cluster "kind":
✅ - Deleting KIND cluster kind ...
✅ - Deleting container registry "kind-registry" at localhost:5000 ...
🏁 - KIND Cluster is gone.
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Creating KIND cluster "kind":
✅ - Creating cluster "kind" ...
✅ - Launching container registry "kind-registry" at localhost:5000 ...
🏁 - KIND cluster is ready
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Initializing K8s cluster
✅ - Launching kind ingress controller ...
✅ - Launching cert-manager ...
✅ - Waiting for cert-manager ...
✅ - Waiting for ingress controller ...
🏁 - Cluster is ready
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Launching network "test-network":
✅ - Creating namespace "test-network" ...
✅ - Provisioning volume storage ...
✅ - Creating fabric config maps ...
✅ - Initializing TLS certificate Issuers ...
✅ - Launching Fabric CAs ...
✅ - Enrolling bootstrap ECert CA users ...
✅ - Creating local node MSP ...
✅ - Launching orderers ...
✅ - Launching peers ...
🏁 - Network is ready.
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Creating channel "mychannel":
✅ - Registering org Admin users ...
✅ - Enrolling org Admin users ...
✅ - Creating channel MSP ...
✅ - Creating channel genesis block ...
✅ - Joining orderers to channel mychannel ...
✅ - Joining org1 peers to channel mychannel ...
✅ - Joining org2 peers to channel mychannel ...
🏁 - Channel is ready.
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Deploying chaincode
✅ - Building chaincode image chaincode ...
✅ - Publishing chaincode image localhost:5000/chaincode ...
✅ - Packaging ccaas chaincode chaincode ...
✅ - Launching chaincode container "localhost:5000/chaincode" ...
✅ - Launching chaincode container "localhost:5000/chaincode" ...
✅ - Launching chaincode container "localhost:5000/chaincode" ...
✅ - Launching chaincode container "localhost:5000/chaincode" ...
✅ - Installing chaincode for org org1 peer peer1 ...
✅ - Installing chaincode for org org1 peer peer2 ...
✅ - Installing chaincode for org org2 peer peer1 ...
✅ - Installing chaincode for org org2 peer peer2 ...
✅ - Approving chaincode chaincode with ID chaincode:105d1916755525d103749c9d6245f1553cd7dc6b10be036d4cd574b050f99bf1 for org1 ...
✅ - Approving chaincode chaincode with ID chaincode:105d1916755525d103749c9d6245f1553cd7dc6b10be036d4cd574b050f99bf1 for org2 ...
✅ - Committing chaincode chaincode ...
🏁 - Chaincode is ready.
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
&lt;/span&gt;&lt;span class="gp"&gt;2026-05-01 14:59:49.508 UTC 0001 INFO [chaincodeCmd] chaincodeInvokeOrQuery -&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Chaincode invoke successful. result: status:200 payload:&lt;span class="s2"&gt;"[]"&lt;/span&gt;
&lt;span class="go"&gt;Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Launching fabric-rest-sample application:
✅ - Constructing fabric-rest-sample connection profiles ...
✅ - Preparing the typescript REST interface ...
The fabric-rest-sample has started.
See https://github.com/hyperledger/fabric-samples/tree/main/asset-transfer-basic/rest-api-typescript for additional usage details.
To access the endpoint:
export SAMPLE_APIKEY=97834158-3224-4CE7-95F9-A148C886653E
&lt;/span&gt;&lt;span class="gp"&gt;curl -s --header "X-Api-Key: $&lt;/span&gt;&lt;span class="o"&gt;{&lt;/span&gt;SAMPLE_APIKEY&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;" http://fabric-rest-sample.localho.st/api/assets
&lt;/span&gt;&lt;span class="go"&gt;🏁 - Fabric REST sample is ready.
&lt;/span&gt;&lt;span class="gp"&gt;Forwarding from 127.0.0.1:3003 -&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;3000
&lt;span class="gp"&gt;Forwarding from [::1]:3003 -&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;3000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Somewhere in rest_sample.sh there is this function:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight awk"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This magical awk script led to 30 hours of debugging a "TLS handshake error"&lt;/span&gt;
&lt;span class="c1"&gt;# moral: do not edit / alter the number of '\' in the following transform:&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nx"&gt;one_line_pem&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"`awk 'NF {sub(/\\n/, ""); printf "&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="err"&gt;\\\\\\&lt;/span&gt;&lt;span class="nx"&gt;n&lt;/span&gt;&lt;span class="s2"&gt;",$0;}' $1`"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This converts a multi-line PEM certificate file into a single-line string, which can be embedded in the JSON connection profile that the REST interface uses to connect to the peers. PEM files look like this:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-----BEGIN CERTIFICATE-----
MIICnTCCAkSgAwIBAgIUHqVnDpJd...
-----END CERTIFICATE-----
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The JSON connection profile needs the certificate as a single string with literal \n characters instead of actual newlines. The awk script does that conversion. The number of backslashes in printf "%s\\\n" is not a mistake - it's exactly what's needed to survive multiple layers of shell interpretation (awk's string parsing, the outer shell's variable interpolation, and then the final JSON embedding).&lt;/p&gt;

&lt;p&gt;I found it on Stack Overflow. The comment saying not to edit was already there. I ignored the comment. At some point while trying to understand what the function did, I adjusted the backslashes. The resulting connection profile looked syntactically fine (valid JSON, readable PEM string) but the embedded certificate was subtly malformed when parsed by the TLS library. The peers rejected connections with a generic TLS handshake error. Nothing in the error message pointed to the certificate content.&lt;/p&gt;

&lt;p&gt;Thirty hours later I found the diff, restored the original function, and the network came back up immediately.&lt;/p&gt;

&lt;p&gt;The lesson I took from this is specific: when you copy a piece of code that works and the original author has left a warning comment, take the comment more seriously than you take your own curiosity.&lt;/p&gt;
&lt;h3&gt;
  
  
  What It Feels Like to Develop with Fabric
&lt;/h3&gt;

&lt;p&gt;Hyperledger Fabric is not built for rapid iteration. The formal chaincode lifecycle (package, install, approve, commit) exists for legitimate reasons in a real multi-organization network, where independent organizations need to independently audit and approve changes to shared business logic before those changes take effect. In that context, the process is the point.&lt;/p&gt;

&lt;p&gt;In a student project with one developer and a fifteen-minute reset cycle, the friction is harder to appreciate. But some of the design choices still made genuine sense even at this scale.&lt;/p&gt;

&lt;p&gt;The privacy model was the clearest one. Booking records on a distributed ledger are readable by every peer operator in the network. Storing userName: "Alice" and userEmail: "&lt;a href="mailto:alice@example.com"&gt;alice@example.com&lt;/a&gt;" directly in those records was obviously wrong. The user hash approach - borrow the idea from Bitcoin, keep personal data in the application layer, put only an opaque identifier on the chain - is the correct design regardless of whether you're building a student project or a production system.&lt;/p&gt;

&lt;p&gt;The block height endpoint also felt worth building properly. Returning a booking record from the world state proves the booking exists &lt;em&gt;now&lt;/em&gt;. Returning the block number and the current chain height proves when it was committed and that the chain has grown since then, making the record progressively harder to retroactively alter. That’s what blockchain verification actually means, and it’s different from just having a database record.&lt;/p&gt;

&lt;p&gt;The rest of it (the endorsement policy, the CA infrastructure, the Raft consensus cluster) was mostly infrastructure I set up correctly and then tried not to touch. Which, given the awk script experience, is probably the right approach.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/YuvrajRaghuvanshiS" rel="noopener noreferrer"&gt;
        YuvrajRaghuvanshiS
      &lt;/a&gt; / &lt;a href="https://github.com/YuvrajRaghuvanshiS/mytravel" rel="noopener noreferrer"&gt;
        mytravel
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Single monorepo for the MyTravel project
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;MyTravel.com - Blockchain-Based Ticket Booking Platform&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;MyTravel.com is a comprehensive ticket booking system that leverages blockchain technology to ensure secure, transparent, and immutable transaction records. This platform integrates traditional web application architecture with Hyperledger Fabric blockchain infrastructure, providing a hybrid web2-web3 solution for customers and travel agencies. The system enables real-time booking management, dynamic pricing, and decentralized transaction verification while maintaining user privacy through anonymized blockchain interactions.&lt;/p&gt;




&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Project Architecture Overview&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;The platform follows a modular microservices architecture with four core components:&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;1. Customer Backend Service (Node.js/Express)&lt;/h3&gt;
&lt;/div&gt;

&lt;p&gt;Handles customer-facing operations including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User registration and JWT-based authentication&lt;/li&gt;
&lt;li&gt;Travel listing discovery and filtering&lt;/li&gt;
&lt;li&gt;Booking management with blockchain verification&lt;/li&gt;
&lt;li&gt;Digital wallet operations&lt;/li&gt;
&lt;li&gt;Profile management with anonymous mode support&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;2. Travel Agency Backend Service (Node.js/Express)&lt;/h3&gt;

&lt;/div&gt;

&lt;p&gt;Manages agency-specific functionalities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agency registration and authentication&lt;/li&gt;
&lt;li&gt;Travel route creation/updation&lt;/li&gt;
&lt;li&gt;Seat inventory management&lt;/li&gt;
&lt;li&gt;Booking reconciliation&lt;/li&gt;
&lt;li&gt;Financial settlements&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;3. React Frontend Application&lt;/h3&gt;

&lt;/div&gt;

&lt;p&gt;Provides unified user interface for:&lt;/p&gt;


&lt;ul&gt;

&lt;li&gt;Customer booking flow&lt;/li&gt;

&lt;li&gt;Agency…&lt;/li&gt;

&lt;/ul&gt;
&lt;/div&gt;
&lt;br&gt;
  &lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/YuvrajRaghuvanshiS/mytravel" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;


&lt;p&gt;&lt;em&gt;This article is rewritten using AI chatbots.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>hyperledgerfabric</category>
      <category>web3</category>
      <category>blockchain</category>
      <category>backend</category>
    </item>
    <item>
      <title>I Told My Friend I’d Hack WhatsApp. Then I Actually Did It.</title>
      <dc:creator>Yuvraj Raghuvanshi</dc:creator>
      <pubDate>Sun, 03 May 2026 12:25:15 +0000</pubDate>
      <link>https://dev.to/yuvrajraghuvanshis/i-told-my-friend-id-hack-whatsapp-then-i-actually-did-it-4om7</link>
      <guid>https://dev.to/yuvrajraghuvanshis/i-told-my-friend-id-hack-whatsapp-then-i-actually-did-it-4om7</guid>
      <description>&lt;p&gt;A friend asked me if I could “hack” WhatsApp. I was bored, so I said yes.&lt;/p&gt;

&lt;p&gt;I didn’t break any encryption. I didn’t exploit a server. What I found was something more interesting: a door that was left open inside Android itself, and an old version of WhatsApp that was still politely holding it open years after everyone else had moved on.&lt;/p&gt;

&lt;p&gt;The result was a Python tool that extracts WhatsApp’s encryption key and message database from a phone (without root access) by exploiting the Android backup system. It picked up 540 stars on GitHub. It also got me a message from a developer trying to help people migrate from WhatsApp to Signal right before WhatsApp changed its terms of service to share data with Facebook. That part felt like it actually mattered.&lt;/p&gt;

&lt;p&gt;This is how it works, from the ground up.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Android Stores App Data
&lt;/h3&gt;

&lt;p&gt;To understand the trick, you first need to understand how Android isolates apps from each other.&lt;/p&gt;

&lt;p&gt;Every Android app runs in its own sandbox. This is not metaphorical — it’s enforced at the Linux kernel level. When an app is installed, Android creates a dedicated Linux user for it. WhatsApp might be u0_a123. Your banking app might be u0_a201. These are real Unix users with real UIDs. Because they're different users, they cannot read each other's files. The file permissions are enforced by the kernel, the same way they are on any Linux system.&lt;/p&gt;

&lt;p&gt;The directory where an app’s private data lives is /data/data/. For WhatsApp, that's /data/data/com.whatsapp. Inside it, you'll find subdirectories that look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/data/data/com.whatsapp/
    databases/
        msgstore.db ← your messages (plaintext SQLite)
        wa.db ← contacts
    files/
        key ← the encryption key
    shared_prefs/
    cache/
    lib/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;/data itself is on a partition that is not accessible to normal users or normal processes. On a rooted phone, you can adb shell in and read it directly because root bypasses the permission system. On a non-rooted phone, you cannot. The files are there (they're on the storage) but the kernel will refuse every read attempt from any process that isn't WhatsApp itself.&lt;/p&gt;

&lt;p&gt;This is the wall that everyone trying to extract WhatsApp data runs into.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjol3u4nxrcg6l8s7tfsi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjol3u4nxrcg6l8s7tfsi.png" width="711" height="264"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: adb shell showing “Permission denied” when trying to ls /data/data/com.whatsapp without root&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Key and the Database
&lt;/h3&gt;

&lt;p&gt;Before going further, it’s worth explaining what these two files actually are and why you need both of them.&lt;/p&gt;

&lt;p&gt;msgstore.db inside the sandbox at /data/data/com.whatsapp/databases/ is a plain, unencrypted SQLite database. WhatsApp works with it directly - reading and writing your messages in cleartext while the app is running. SQLite is a well-understood format; tools to read it are everywhere.&lt;/p&gt;

&lt;p&gt;What you see if you browse your phone’s SD card storage is different: msgstore.db.crypt14, stored at /sdcard/WhatsApp/Databases/. This is the encrypted &lt;em&gt;backup&lt;/em&gt; copy that WhatsApp writes to external storage so Google Drive can sync it. The number in the extension (crypt12, crypt14) indicates the encryption scheme version. This external copy is encrypted with AES-256 and is unreadable without the key.&lt;/p&gt;

&lt;p&gt;The key file lives at /data/data/com.whatsapp/files/key. It's a binary file containing key material that WhatsApp generates based on private factors specific to your account and device. WhatsApp doesn't document the generation process publicly - we only know it exists because the file exists, and because without it, the crypt14 file is unreadable garbage.&lt;/p&gt;

&lt;p&gt;Here’s what makes the key particularly valuable: it isn't rotated. Once you have it, you can use it to decrypt any crypt14 backup from that account (past ones, current ones, and future ones) until WhatsApp generates a new key. This is different from, say, TLS session keys that are discarded after use. The same key file that works today will work on a database backup taken six months ago, and on one taken six months from now.&lt;/p&gt;

&lt;p&gt;The problem is getting the key. It lives inside the sandbox. Root gives you a sledgehammer to break through it. But there’s a quieter way.&lt;/p&gt;
&lt;h3&gt;
  
  
  What adb backup Is
&lt;/h3&gt;

&lt;p&gt;Android Debug Bridge (adb) is a command-line tool included in the Android SDK. It lets a connected computer communicate with an Android device for development purposes: install APKs, run shell commands, read logs, transfer files. When you enable USB Debugging on your phone, you're enabling adb.&lt;/p&gt;

&lt;p&gt;One of the things adb could do, from Android 4.0 onwards, was create full app backups:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;adb backup &lt;span class="nt"&gt;-f&lt;/span&gt; myApp.ab &lt;span class="nt"&gt;-apk&lt;/span&gt; com.foobar.app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This would back up the app’s APK and its private data directory to a file on your computer. The backup system was designed for legitimate use: before switching phones, before a factory reset. It needed to read files from /data/data/ (which normally only the app itself can do) so it ran with elevated OS-level privileges to do so.&lt;/p&gt;

&lt;p&gt;This is the door. The backup system could read app sandboxes. The question was whether it was allowed to for any given app.&lt;/p&gt;

&lt;h3&gt;
  
  
  AndroidManifest.xml and the allowBackup Flag
&lt;/h3&gt;

&lt;p&gt;Every Android app ships with a file called AndroidManifest.xml. This is the app's declaration to the operating system - it lists the app's package name, the permissions it needs, the activities it contains, and dozens of other properties. Android reads this file when installing an app and uses it to configure how the app is treated by the system.&lt;/p&gt;

&lt;p&gt;One of those properties is android:allowBackup. It controls whether adb backup is permitted to include this app's data in a backup. The default value, historically, was true - if you didn't specify it, backups were allowed.&lt;/p&gt;

&lt;p&gt;When WhatsApp realised that this flag meant anyone with USB debugging access could extract their users’ messages with a single command, they set it to false in their manifest. After that point, running adb backup com.whatsapp would produce a backup file, but it would be empty. The backup system would see allowBackup="false" in the manifest and skip the data entirely.&lt;/p&gt;

&lt;p&gt;Here is what that flag looks like inside a decompiled WhatsApp APK manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;manifest&lt;/span&gt; &lt;span class="na"&gt;xmlns:android=&lt;/span&gt;&lt;span class="s"&gt;"http://schemas.android.com/apk/res/android"&lt;/span&gt;
    &lt;span class="na"&gt;package=&lt;/span&gt;&lt;span class="s"&gt;"com.whatsapp"&lt;/span&gt;
    &lt;span class="na"&gt;android:versionCode=&lt;/span&gt;&lt;span class="s"&gt;"452658"&lt;/span&gt;
    &lt;span class="na"&gt;android:versionName=&lt;/span&gt;&lt;span class="s"&gt;"2.21.1.1"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;

    &lt;span class="nt"&gt;&amp;lt;application&lt;/span&gt;
        &lt;span class="na"&gt;android:label=&lt;/span&gt;&lt;span class="s"&gt;"@string/app_name"&lt;/span&gt;
        &lt;span class="na"&gt;android:icon=&lt;/span&gt;&lt;span class="s"&gt;"@drawable/ic_launcher"&lt;/span&gt;
        &lt;span class="na"&gt;android:allowBackup=&lt;/span&gt;&lt;span class="s"&gt;"false"&lt;/span&gt; &lt;span class="err"&gt;←&lt;/span&gt; &lt;span class="err"&gt;current&lt;/span&gt; &lt;span class="err"&gt;versions:&lt;/span&gt; &lt;span class="err"&gt;explicitly&lt;/span&gt; &lt;span class="err"&gt;denied&lt;/span&gt;
        &lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Modern WhatsApp has allowBackup="false". But old versions of WhatsApp (specifically versions from before WhatsApp realised this was a problem) had allowBackup="true", or simply didn't specify it at all, defaulting to permitted.&lt;/p&gt;

&lt;p&gt;The version used in this tool is &lt;strong&gt;v2.11.431&lt;/strong&gt; , from around 2015. At that version, the manifest permits backups. The Android backup system doesn't know it's running an old version. It reads the manifest, sees the flag, and opens the data directory.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Trick: Uninstall (Keeping Data), Backup, Restore
&lt;/h3&gt;

&lt;p&gt;The loophole requires three steps: uninstall current WhatsApp while keeping its data intact, install the old version that allows backups, run the backup, then reinstall current WhatsApp.&lt;/p&gt;

&lt;p&gt;That first step (“uninstall while keeping data”) deserves explanation, because it’s not what most people think of when they hear “uninstall.”&lt;/p&gt;

&lt;p&gt;Android normally does two things when you uninstall an app: it removes the APK, and it wipes /data/data/. Your messages, settings, everything - gone. This is the standard uninstall. But adb exposes a flag that separates these two operations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;adb shell pm uninstall &lt;span class="nt"&gt;-k&lt;/span&gt; com.whatsapp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The -k flag means "keep data." The APK is removed. The data directory at /data/data/com.whatsapp is left completely untouched - key, databases, shared preferences, all of it still there, owned by a UID that no longer has an app attached to it.&lt;/p&gt;

&lt;p&gt;This is the correct way to do the downgrade. Android does not allow installing an older version of an app directly over a newer one — it checks version codes and refuses with INSTALL_FAILED_VERSION_DOWNGRADE. In-place downgrade is blocked. But if the app is already uninstalled (even with -k), there's nothing to compare version codes against. The legacy APK installs cleanly. Then when Legacy WhatsApp starts, it finds the existing data directory (the one that current WhatsApp left behind) and picks up from exactly where it was. The key file is there. The databases are there. Legacy WhatsApp doesn't care that the data was written by a newer version.&lt;/p&gt;

&lt;p&gt;So the sequence is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Current WhatsApp is installed on the device. Its data (key, msgstore.db) is in /data/data/com.whatsapp.&lt;/li&gt;
&lt;li&gt;Uninstall current WhatsApp with adb shell pm uninstall -k com.whatsapp. The APK is gone; the data directory survives.&lt;/li&gt;
&lt;li&gt;Install Legacy WhatsApp v2.11.431 via adb install. It finds the existing data directory and inherits it.&lt;/li&gt;
&lt;li&gt;Run adb backup com.whatsapp. Legacy WhatsApp's manifest says allowBackup="true", so the backup system reads the full data directory and writes it to whatsapp.ab on the computer. This includes the key file and the plaintext msgstore.db.&lt;/li&gt;
&lt;li&gt;Uninstall Legacy WhatsApp. Reinstall current WhatsApp. The data directory is still there. WhatsApp opens normally.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The user ends up with their current WhatsApp running unchanged, and the computer has a copy of the key and msgstore.db.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Happens on the Phone Screen During Backup
&lt;/h3&gt;

&lt;p&gt;When adb backup is triggered, Android shows a system prompt on the phone that the user must explicitly interact with. This is a security measure - silent background backups are not allowed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgn2y1pdk8r1q38fpe6x7.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgn2y1pdk8r1q38fpe6x7.jpeg" width="278" height="500"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: the “Full backup” system dialog — the image uploaded above showing password field and “Back up my data” button&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The dialog says: “A full backup of all data to a connected desktop computer has been requested. Do you want to allow this to happen? If you did not request the backup yourself, do not allow the operation to proceed.”&lt;/p&gt;

&lt;p&gt;The user taps “Back up my data.” The backup begins.&lt;/p&gt;

&lt;p&gt;There’s also a password field. If the user enters a password here, the backup archive is encrypted with AES before it’s written to disk. The script asks the user on the terminal whether they want to set a backup password. If they do, they need to type the same password into both places — the terminal (so the script knows it for decryption later) and the phone screen (so Android encrypts with it). If the passwords don’t match, the extracted backup will be unreadable.&lt;/p&gt;

&lt;p&gt;This is also why one of the troubleshooting notes says: “If you have set a &lt;em&gt;default&lt;/em&gt; backup password in your Android settings, then this MUST be the backup password that you PROVIDE when prompted.” Some users have a system-wide default backup password set in their Android settings that they’ve forgotten about. Android silently uses it to encrypt the backup. The script then can’t decrypt it because it received a different password.&lt;/p&gt;
&lt;h3&gt;
  
  
  The .ab File Format
&lt;/h3&gt;

&lt;p&gt;adb backup produces a .ab file - Android Backup. This is not a zip or a standard archive format. Its structure is documented because Android is open source, so the format is known.&lt;/p&gt;

&lt;p&gt;The file begins with a plaintext header:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ANDROID BACKUP
4
1
none
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These four lines tell you: this is an Android backup file, format version 4, compressed (1 = yes), encryption algorithm (none, or AES-256 if a password was set). After the header, the rest of the file is a zlib-compressed tar archive. Inside the tar is the app’s data directory, with paths like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apps/com.whatsapp/f/key
apps/com.whatsapp/db/msgstore.db
apps/com.whatsapp/db/wa.db
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The f/ prefix corresponds to the files/ directory in the app's data, and db/ to databases/. The tar contains the full directory structure, just with these path prefixes.&lt;/p&gt;

&lt;p&gt;To unpack this, the tool uses &lt;strong&gt;android-backup-extractor&lt;/strong&gt;  — a small open source Java utility that understands the .ab header, handles the optional AES decryption, decompresses the zlib payload, and extracts the tar. I did try to rewrite this part in Python. The unencrypted case is straightforward - strip the header, decompress with zlib, untar. But when a password is involved, the AES key derivation that Android uses involves specific parameters and byte-level handling that I couldn't get right. Rather than ship a broken decryptor, I kept the Java dependency. The android-backup-extractor handles it correctly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;java &lt;span class="nt"&gt;-jar&lt;/span&gt; abe.jar unpack whatsapp.ab whatsapp.tar &lt;span class="o"&gt;[&lt;/span&gt;password]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, the tar is standard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;tar&lt;/span&gt; &lt;span class="nt"&gt;-xf&lt;/span&gt; whatsapp.tar
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the key file and databases are on disk, readable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installing the Legacy Version: INSTALL_FAILED_VERSION_DOWNGRADE
&lt;/h3&gt;

&lt;p&gt;The first attempt most people make is to just install the legacy APK directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;adb &lt;span class="nb"&gt;install &lt;/span&gt;LegacyWhatsApp.apk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Android refuses. It checks the version code of the APK being installed against the already-installed app. If the incoming version code is lower, it rejects with INSTALL_FAILED_VERSION_DOWNGRADE. Downgrade over an existing installation is blocked.&lt;/p&gt;

&lt;p&gt;The correct approach is to uninstall current WhatsApp first (using the -k flag to preserve the data directory) and then install the legacy APK cleanly. With no existing installation to compare against, Android has no version code conflict to enforce.&lt;/p&gt;

&lt;p&gt;The --allow-reboot flag handles a related but separate problem. On some devices, even after uninstalling with -k, the installation fails for other reasons. Rebooting before the install clears whatever state was causing the refusal. The exact mechanism is in the "if it works, don't touch it" category - the device is rebooted via adb reboot before the legacy APK is installed, and on devices where it was failing, it stops failing. The most likely explanation is that some manufacturers' Android builds check additional conditions at runtime that aren't re-evaluated immediately after boot.&lt;/p&gt;

&lt;p&gt;There’s a related issue on MIUI (Xiaomi’s Android skin): adb install is blocked by a separate setting ("Install via USB") in Developer Options, distinct from USB Debugging. Without it, every install attempt fails with INSTALL_FAILED_USER_RESTRICTED, regardless of version codes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Running Legacy WhatsApp Before the Backup
&lt;/h3&gt;

&lt;p&gt;Issue #16 in the repository documents one of the more interesting device-specific behaviours encountered. On some devices, the backup would run without errors but produce a nearly empty archive — no key, no database.&lt;/p&gt;

&lt;p&gt;The cause turned out to be that Legacy WhatsApp hadn’t been launched even once before the backup was taken. On those devices, Android’s backup system only includes app data that has been “activated” — meaning the app has run at least once since installation. Without that first launch, the data directory exists (carried over from the current WhatsApp installation), but the backup hook reports nothing to back up.&lt;/p&gt;

&lt;p&gt;The fix was to launch Legacy WhatsApp briefly before triggering the backup. The script does this, with a prompt asking the user to open the app and let it sit for a few seconds before continuing. It’s the kind of fix that makes no sense until you see the behaviour it’s correcting.&lt;/p&gt;

&lt;p&gt;Approximately 90% of the issues filed on this project follow the same pattern: something fails silently, the fix involves a step that has no obvious causal relationship with the problem, and once you add the step the issue disappears. Android’s backup system is not well-documented in its edge cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  ADB Over TCP: Using the Tool Without a USB Cable — and Seeing the Screen
&lt;/h3&gt;

&lt;p&gt;Normal adb communication happens over USB. The tool also supports ADB over TCP - connecting to a device over Wi-Fi instead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python3&lt;/span&gt; &lt;span class="n"&gt;wa_kdbe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;tcp&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt; &lt;span class="mf"&gt;192.168&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;43.130&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;tcp&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="mi"&gt;5555&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Android has had ADB-over-TCP support built in since early versions. Once enabled (either through Developer Options on some devices, or by first connecting via USB and running adb tcpip 5555), the device listens for ADB connections on port 5555 over the local network.&lt;/p&gt;

&lt;p&gt;The practical use cases: a broken USB port, a device on the other side of a room, or a device anywhere on the same network as the computer. The --tcp-ip flag accepts any IP address, including a phone connected via mobile hotspot. The tool works identically over TCP as over USB; the ADB protocol doesn't care about the transport layer.&lt;/p&gt;

&lt;p&gt;When running in TCP mode, there’s an additional flag: --scrcpy. This uses &lt;a href="https://github.com/Genymobile/scrcpy" rel="noopener noreferrer"&gt;scrcpy&lt;/a&gt; (an open source tool by Genymobile) to mirror the phone's screen to a window on the computer and allow touch input via mouse and keyboard. In USB mode this is less necessary since the device is physically in hand, but over TCP the phone might be in another room. With --scrcpy, the backup dialog that appears on the phone screen (the one asking the user to tap "Back up my data") can be interacted with directly from the computer. No need to walk over to the device.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;python3&lt;/span&gt; &lt;span class="n"&gt;wa_kdbe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;tcp&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt; &lt;span class="mf"&gt;192.168&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;43.130&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;tcp&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;port&lt;/span&gt; &lt;span class="mi"&gt;5555&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;scrcpy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both features were added because they could be added, and because removing the USB requirement while adding screen control made the tool work from genuinely anywhere on the same network.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Signal Developer Who Reached Out
&lt;/h3&gt;

&lt;p&gt;A few weeks after the project picked up traction, I received a message from a developer named Sam:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;“I want to make it easy for people to switch from WhatsApp to Signal, which I think is important, especially given the upcoming changes to the TOS of WhatsApp [sharing all user data with Facebook from February 8]. I’ve forked the Signal Android App and added a WhatsApp import functionality, that migrates your existing WhatsApp threads from msgstore.db to Signal. However, the process of retrieving msgstore.db is way too complicated for many people and most people don’t want to root their phone.”&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;He had built a Signal fork that could import WhatsApp’s msgstore.db directly. The blocking problem was that getting msgstore.db off a non-rooted phone was too technically involved for most users. He was asking if the tool could be adapted to make the extraction seamless enough for non-technical users.&lt;/p&gt;

&lt;p&gt;This is the part that reframes what the project actually was. What started as me being bored and proving something to a friend turned out to be infrastructure for something with a real privacy rationale: helping people leave a platform that had just announced it would share their data with one of the largest advertising companies in the world, and take their message history with them.&lt;/p&gt;

&lt;p&gt;The msgstore.db that the tool extracts (once decrypted with the key) is a standard SQLite database. Every message is in there. You can read it with any SQLite browser, write queries against it, import it into other applications. The data is yours. It exists on your device. The only thing standing between you and it was the sandbox, the allowBackup flag, and the question of which version of WhatsApp was currently installed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why It No Longer Works Reliably
&lt;/h3&gt;

&lt;p&gt;I have marked this tool as “NO LONGER MAINTAINED.” This needs context.&lt;/p&gt;

&lt;p&gt;The allowBackup loophole in WhatsApp v2.11.431 hasn't been patched in the sense that the old APK was changed - the APK still has allowBackup="true". But Android itself has progressively made downgrade installations harder and backup extraction more restricted.&lt;/p&gt;

&lt;p&gt;Newer Android versions (11, 12, 13+) have tightened the backup system significantly. Google restricted what adb backup could access, eventually deprecating the feature entirely in higher API levels. The backup dialog still appears, the process runs, but on many modern Android builds the data directories are excluded by the OS regardless of what the app manifest says. The flag that WhatsApp forgot to set stopped mattering because Android stopped honouring it.&lt;/p&gt;

&lt;p&gt;The tool works on older Android versions and some devices where the restrictions haven’t fully landed. On others (most new devices running Android 12 or 13) it will produce an empty or near-empty backup. The backup system is still there; it just doesn’t open the data directories anymore.&lt;/p&gt;

&lt;p&gt;I moved on when it became clear that the percentage of devices where it worked reliably was shrinking with every Android release, and that fixing it would require a different approach entirely — one that likely involves ADB shell commands and run-as, which has its own restrictions and device-specific behaviour.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the Tool Actually Does, End to End
&lt;/h3&gt;

&lt;p&gt;For completeness, here is the full sequence the script runs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Connect&lt;/strong&gt; to the device via USB or TCP. Verify adb can see it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Back up the current WhatsApp APK&lt;/strong&gt; from /data/app/com.whatsapp/ to /data/local/tmp/WhatsAppbackup.apk on the device. This is the user's currently installed version, saved so it can be reinstalled later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uninstall current WhatsApp&lt;/strong&gt; with adb shell pm uninstall -k com.whatsapp. The -k flag keeps the data directory intact.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Install Legacy WhatsApp v2.11.431&lt;/strong&gt; via adb install. It finds the existing data directory and inherits it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt the user&lt;/strong&gt; to open Legacy WhatsApp briefly on the phone, then return to the terminal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run&lt;/strong&gt;  &lt;strong&gt;adb backup com.whatsapp&lt;/strong&gt;. The system dialog appears on the phone. The user taps "Back up my data."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Receive&lt;/strong&gt;  &lt;strong&gt;whatsapp.ab&lt;/strong&gt; on the computer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unpack with android-backup-extractor&lt;/strong&gt; : java -jar abe.jar unpack whatsapp.ab whatsapp.tar.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extract the tar&lt;/strong&gt;. Locate key and msgstore.db (and other databases).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Copy files&lt;/strong&gt; to extracted//.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optionally compress&lt;/strong&gt; the folder as a password-protected 7z archive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Uninstall Legacy WhatsApp&lt;/strong&gt;. Reinstall the backed-up current WhatsApp APK.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Copy&lt;/strong&gt;  &lt;strong&gt;msgstore.db&lt;/strong&gt; back to the phone's SD card for convenience.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The reinstallation in step 11 uses the APK that was saved in step 2 (the user’s exact current version) rather than downloading from the Play Store. This matters because the Play Store version might have updated since the backup was taken, and using an intermediate version could cause issues with the database format.&lt;/p&gt;

&lt;p&gt;The whole process takes a few minutes. The window where the user’s WhatsApp is replaced by the legacy version is the longest part — the backup can take a while depending on database size. If something goes wrong during reinstallation and WhatsApp refuses to open, the recovery is a clean reinstall from the Play Store followed by restoring from the local or Google Drive backup that the README tells you to take before starting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftfkw5tokoc2v05l6oqv1.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftfkw5tokoc2v05l6oqv1.gif" width="560" height="314"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;GIF: Screen recording of complete run of the tool&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That warning at the top of the README (“Hope for the best, prepare for the worst”) is there for a reason.&lt;/p&gt;

&lt;p&gt;Project URL: github.com/YuvrajRaghuvanshiS/WhatsApp-Key-Database-Extractor&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is for educational purposes — for understanding how Android’s app sandbox and backup system work and how a single overlooked flag in a manifest file can undermine both. Use it on your own data.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is rewritten using AI chatbots.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;May 03, 2026&lt;/em&gt;&lt;/p&gt;

</description>
      <category>cybersecurity</category>
      <category>reverseengineering</category>
      <category>android</category>
      <category>whatsapp</category>
    </item>
    <item>
      <title>The Website That Looked Like It Needed Selenium (But Didn’t)</title>
      <dc:creator>Yuvraj Raghuvanshi</dc:creator>
      <pubDate>Thu, 30 Apr 2026 10:33:32 +0000</pubDate>
      <link>https://dev.to/yuvrajraghuvanshis/the-website-that-looked-like-it-needed-selenium-but-didnt-1p1</link>
      <guid>https://dev.to/yuvrajraghuvanshis/the-website-that-looked-like-it-needed-selenium-but-didnt-1p1</guid>
      <description>&lt;p&gt;For my thesis I needed a large corpus of Hindi poetry. &lt;a href="https://hindwi.org" rel="noopener noreferrer"&gt;Hindwi&lt;/a&gt; is one of the better maintained Hindi literature archives on the internet. Thousands of poems, hundreds of poets, content spanning from the 8th century to contemporary writers. It had everything I needed.&lt;/p&gt;

&lt;p&gt;I didn’t plan to spend much time on the scraper. Collect the data, move on.&lt;/p&gt;

&lt;p&gt;That didn’t happen.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Obvious Problem
&lt;/h3&gt;

&lt;p&gt;Visit &lt;a href="https://hindwi.org/poets" rel="noopener noreferrer"&gt;hindwi.org/poets&lt;/a&gt; and you’ll see a listing of poets. Scroll down and more appear. Visit an individual poet’s page and the same thing happens — poems load as you scroll. This is the pattern that makes every scraper writer reach for Selenium almost reflexively. The content isn’t in the initial HTML. JavaScript is loading it dynamically. You need a browser.&lt;/p&gt;

&lt;p&gt;So I set up Selenium. Headless Chrome, scroll simulation, wait for elements to appear, extract content. It worked. It was also agonizingly slow.&lt;/p&gt;

&lt;p&gt;The real problem wasn’t just speed — it was that Selenium is fundamentally impractical to parallelize. You can’t easily spin up ten browser instances and scrape ten poets simultaneously the way you can with threads making HTTP requests. Each browser instance carries its own rendering engine, memory space, and JavaScript runtime. The resource cost compounds quickly, and the coordination between instances is a nightmare. Even with aggressive parallelism, back-of-envelope math on 25,000+ poems made it clear this would take days, not hours.&lt;/p&gt;

&lt;p&gt;There had to be a better way.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ten Minutes in DevTools
&lt;/h3&gt;

&lt;p&gt;Before writing any more Selenium code, I opened the browser DevTools Network tab and watched what actually happened when the page loaded more content.&lt;/p&gt;

&lt;p&gt;This is always worth doing before committing to browser automation. Dynamic-looking behavior on the frontend is still, at the network level, just HTTP requests. The browser has to get the data from somewhere. The question is whether that somewhere is directly reachable.&lt;/p&gt;

&lt;p&gt;On Hindwi, when you scroll to the bottom of the poets listing, the browser fires a request like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://www.hindwi.org/PoetCollection?lang=2&amp;amp;pageNumber=2&amp;amp;Info=poet
&amp;amp;StartsWith=&amp;amp;keyword=&amp;amp;typeID=659186cb-44e7-4d94-8b1a-fc70f939a733
&amp;amp;TypeSlug=poets&amp;amp;contentFilter=&amp;amp;_=1777462454692
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plain GET request. No authentication tokens in the body, no encrypted signatures, no WebSocket handshake. Just query parameters. The _=1777462454692 at the end is a cache-busting timestamp the browser adds automatically - the server doesn't validate it, so scrapers can ignore it entirely.&lt;/p&gt;

&lt;p&gt;The response that came back was raw HTML — not JSON, not XML. Just HTML cards containing poet names, dates, and profile links, ready to be injected into the DOM. So the website wasn’t serving a proper API, but it was serving something structured, paginated, and directly reachable over plain HTTP.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vg5lsuq7f8x7a9jnxv5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vg5lsuq7f8x7a9jnxv5.png" width="800" height="208"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: DevTools Network tab showing the /PoetCollection request and its HTML response body&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The next question was: how does the browser know what URL to request for page 3, page 4, page 5? Where does that information come from?&lt;/p&gt;
&lt;h3&gt;
  
  
  The URL Was Sitting Right There
&lt;/h3&gt;

&lt;p&gt;I looked at the page source. And there they were — all of them, already embedded in the initial HTML response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"contentLoadMore"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"contentLoadMorePaging"&lt;/span&gt; 
         &lt;span class="na"&gt;data-url=&lt;/span&gt;&lt;span class="s"&gt;"/PoetCollection?lang=2&amp;amp;pageNumber=3&amp;amp;Info=poet
                  &amp;amp;StartsWith=&amp;amp;keyword=&amp;amp;typeID=659186cb-44e7-4d94-8b1a-fc70f939a733
                  &amp;amp;TypeSlug=poets&amp;amp;contentFilter="&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;svg&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"screenLoader"&lt;/span&gt; &lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/svg&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The site pre-embeds the URLs for every subsequent page inside data-url attributes on div.contentLoadMorePaging elements. The JavaScript reads these attributes and fires the requests when you scroll into view. But from a scraper's perspective, the URLs are already there in the first response - you don't need to scroll anything. You just parse them out and fetch them directly.&lt;/p&gt;

&lt;p&gt;This was the moment Selenium became irrelevant.&lt;/p&gt;

&lt;p&gt;What looked like dynamic JavaScript-driven content was really just a simple pattern: fetch the initial page, extract the hidden data-url values, make those HTTP requests directly. No browser. No scroll simulation. No waiting for DOM mutations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28ve7ifmf5sak1r2ig2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28ve7ifmf5sak1r2ig2g.png" width="800" height="208"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: page source with the contentLoadMorePaging div and data-url attribute visible&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Same Pattern, Everywhere
&lt;/h3&gt;

&lt;p&gt;Once I knew what to look for, I checked the individual poet pages. Same pattern. A poet with more than 50 poems (Mona Gulati, for example) has this in her initial page response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"contentLoadMore"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"contentLoadMorePaging"&lt;/span&gt; 
         &lt;span class="na"&gt;data-url=&lt;/span&gt;&lt;span class="s"&gt;"/PoetCollection?lang=2&amp;amp;pageNumber=2&amp;amp;info=ghazals
                  &amp;amp;SEO_Slug=kavita&amp;amp;Id=34074990-5be7-43e9-8a85-6aaa0be4833c
                  &amp;amp;Info=ghazal&amp;amp;StartsWith=a&amp;amp;typeID=659186cb-...
                  &amp;amp;contentType=kavita&amp;amp;sort=popularity-desc&amp;amp;filter="&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"contentLoadMorePaging"&lt;/span&gt; 
         &lt;span class="na"&gt;data-url=&lt;/span&gt;&lt;span class="s"&gt;"/PoetCollection?lang=2&amp;amp;pageNumber=3..."&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both page 2 and page 3 are listed upfront in the first response. The site hands you the complete roadmap immediately. Fetch once, and you know exactly what to fetch next — no interaction, no scrolling, no waiting.&lt;/p&gt;

&lt;p&gt;This held for dohas, quotes, and every other content type on the site. The contentLoadMorePaging pattern was consistent across all of Hindwi. Understanding it once meant the whole site was open.&lt;/p&gt;

&lt;h3&gt;
  
  
  Turning the Insight Into Code
&lt;/h3&gt;

&lt;p&gt;The scraper that came out of this is conceptually simple. For the poet listing, hit the /PoetCollection endpoint and keep incrementing pageNumber until you get an empty response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_paginated_poet_cards&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extra_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lang&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pageNumber&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Info&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;extra_params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extra_params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_soup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;POETS_ENDPOINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;cards&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.poetColumn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;cards&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cards&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For poem lists, fetch the poet’s kavita page, parse whatever poems are already in the initial HTML, then extract and follow every data-url:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_extract_poem_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kavita_url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_soup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kavita_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;poems&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_parse_poem_list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;pagination_divs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.contentLoadMorePaging[data-url]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;seen_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;div&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pagination_divs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;div&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;data_url&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;data_url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;seen_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;seen_urls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;full_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;urljoin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.hindwi.org&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;paginated_soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_soup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;poems&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_parse_poem_list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paginated_soup&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;poems&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No browser. No scroll events. Two BeautifulSoup calls per paginated poet.&lt;/p&gt;

&lt;p&gt;One thing worth mentioning about _parse_poem_list: the initial page and the dynamically loaded fragment pages use different CSS classes for their poem cards. The initial listing uses div.rt_contentBodyListItems, while the paginated HTML fragments come back using div.contentListItems.nwPoetListBody. I caught this when certain poets were returning suspiciously fewer poems than their profile pages suggested - the paginated content was being silently skipped because the selector only matched the first class. A multi-selector handles both:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cards&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.rt_contentBodyListItems, div.contentListItems.nwPoetListBody&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is exactly the kind of thing that produces wrong results silently. No error, no exception — just a poem count that’s quietly lower than it should be.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furbs8krjxj9dlnkfgk23.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furbs8krjxj9dlnkfgk23.png" width="705" height="424"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: terminal output showing a poet being processed with their correct poem count&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Extracting the Poems
&lt;/h3&gt;

&lt;p&gt;Each poem lives on its own URL. The page serves the text in Devanagari and, for many poems, a Romanized transliteration toggled by a button. In the HTML, both versions are already present — just hidden or shown depending on which toggle is active:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Devanagari
&lt;/span&gt;&lt;span class="n"&gt;hindi_div&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pMC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-roman&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;off&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Romanized
&lt;/span&gt;&lt;span class="n"&gt;roman_div&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HindwiRoman&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;roman_pmc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;roman_div&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pMC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-roman&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The text itself is structured as &lt;/p&gt;
&lt;p&gt; tags containing &lt;span&gt; tags per word or phrase. Joining the spans within each paragraph gives one line:&lt;br&gt;
&lt;/span&gt;&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;hindi_div&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;span&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;hindi_lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Both versions get saved as separate plain text files. Not every poem has a Romanized version, so the code returns None for the roman field when it doesn't exist rather than an empty list - preserving the distinction between "no Roman version" and "Roman version is blank."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz8gznijxuhvda31pam53.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz8gznijxuhvda31pam53.png" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: a poem page on Hindwi showing the Devanagari text alongside the Roman toggle&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Concurrency — The Real Payoff
&lt;/h3&gt;

&lt;p&gt;With Selenium out of the picture, threading became trivial. The poem scraper processes all poets concurrently with a thread pool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_poems&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ThreadPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_process_poet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;poet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                   &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;poet&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;poets&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;as_completed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ten threads making lightweight HTTP requests is nothing. This is what was completely impractical with Selenium — ten browser instances would have needed a dedicated server to run without thrashing. Ten requests threads ran fine on a laptop, barely registering on the CPU.&lt;/p&gt;

&lt;p&gt;Every request goes through a shared get_soup wrapper that enforces a 1-second politeness delay and retries with exponential backoff on failures. Errors at any level - a single poem, an entire poet - get logged and skipped rather than crashing the thread. The run completed cleanly over about two hours. A small number of URLs consistently returned server errors and landed in the log; everything else went through without issue.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Result
&lt;/h3&gt;

&lt;p&gt;Two hours. 25,000+ poems across hundreds of poets. Devanagari and Romanized versions where available. Structured metadata including titles, URLs, slugs, and categories per poem. Around 300MB of text in total.&lt;/p&gt;

&lt;p&gt;The dependency list tells the whole story:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;beautifulsoup4==4.13.4
requests==2.32.4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No Selenium, no browser drivers, no Playwright, no headless Chrome. Just HTTP requests and HTML parsing.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I Took From This
&lt;/h3&gt;

&lt;p&gt;The instinct to reach for Selenium when you see dynamic content is understandable — it’s the safe default that definitely works. But dynamic content loading just means the browser is making HTTP requests after the initial page load. Those requests go somewhere, return something, and in most cases can be replicated directly.&lt;/p&gt;

&lt;p&gt;The contentLoadMorePaging pattern on Hindwi is a good illustration of how often websites like this are more accessible than they appear. The site wasn't hiding anything. It was handing out pagination URLs in plain HTML, sitting in data-url attributes, ready to be read. JavaScript happened to be the first thing reading them - until a scraper was.&lt;/p&gt;

&lt;p&gt;Ten minutes in the Network tab before writing any scraping code is almost always worth it. In this case, it was the difference between days of Selenium pain and a two-hour requests script that finished before lunch.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is for educational purposes — all ethical considerations have been addressed, including measures such as rate limiting and conducting scraping during periods of low website traffic.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is rewritten using AI chatbots.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;April 30, 2026&lt;/em&gt;&lt;/p&gt;

</description>
      <category>reverseengineering</category>
      <category>python</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Training a Classifier on Huge Dataset When RAM Is Not Your Friend</title>
      <dc:creator>Yuvraj Raghuvanshi</dc:creator>
      <pubDate>Mon, 13 Apr 2026 18:05:03 +0000</pubDate>
      <link>https://dev.to/yuvrajraghuvanshis/training-a-classifier-on-huge-dataset-when-ram-is-not-your-friend-kle</link>
      <guid>https://dev.to/yuvrajraghuvanshis/training-a-classifier-on-huge-dataset-when-ram-is-not-your-friend-kle</guid>
      <description>&lt;p&gt;I didn’t set out to build a custom data loader. I set out to train a model on the Quick, Draw! dataset.&lt;/p&gt;

&lt;p&gt;The data pipeline was supposed to be the boring part — the few lines you write before the interesting work starts. It ended up being most of the work, the source of the most frustrating bugs, and, in retrospect, the most interesting engineering decision of the whole project.&lt;/p&gt;

&lt;p&gt;This is the story of why I ended up with a directory containing millions of individual .npy files, and why that turned out to be the right call.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Quick, Draw! Actually Is
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/googlecreativelab/quickdraw-dataset" rel="noopener noreferrer"&gt;Quick, Draw!&lt;/a&gt; is a Google dataset of human drawings collected from a browser game where players had 20 seconds to draw a prompted word. It has 345 categories — cats, airplanes, zigzags, The Eiffel Tower — with up to 100,000 drawings per class. That’s about 50 million drawings in total.&lt;/p&gt;

&lt;p&gt;What makes it interesting for ML, and annoying for data pipelines, is that each drawing has two representations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raster images&lt;/strong&gt;  — each drawing rendered as a 28×28 grayscale bitmap, stored as a flat array of 784 values. These come in .npy files where a single file for one class contains an array of shape (N, 784). For 100,000 samples, that's 100,000 rows of 784 floats per file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stroke sequences&lt;/strong&gt;  — the original drawing data: a sequence of (dx, dy, pen_state) triplets representing how the pen moved. These come in .npz files, split into train, val, and test keys. The stroke data varies in length per drawing - a simple zigzag might have 10 points, a detailed drawing of The Great Wall of China might have hundreds.&lt;/p&gt;

&lt;p&gt;The model I wanted to build was multimodal: it would take both representations as input simultaneously, letting a CNN process the image and an LSTM process the stroke sequence, then merge their outputs for classification. Which meant the pipeline had to serve both modalities in sync, for every sample, across 345 classes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fheuo2b4r54xvhjx3d41s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fheuo2b4r54xvhjx3d41s.png" alt="Screenshot: a sample drawings from Quick, Draw! — both the raster image and stroke visualization side by side" width="800" height="370"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: a sample drawings from Quick, Draw! — both the raster image and stroke visualization side by side&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Naive Approach and Why It Dies
&lt;/h3&gt;

&lt;p&gt;The obvious first attempt is the one-liner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat.npy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# shape: (~100000, 784)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That loads fine for one class. You run it for a few classes, you’re still fine. Then somewhere around class 20 or 30 your process gets killed by the OOM killer, or your Jupyter kernel crashes silently, or the remote server you’ve SSH’d into drops your connection and takes your training run with it.&lt;/p&gt;

&lt;p&gt;With 345 classes at 30,000 samples each (my chosen limit) — we’re talking about loading roughly 10 million samples into RAM at startup. At around 11% of a 128GB server’s memory for 10,000 samples per class, the math on 30,000 samples gets uncomfortable fast. And that’s before you account for the stroke data.&lt;/p&gt;

&lt;p&gt;The real problem isn’t just peak RAM usage. It’s that loading everything upfront means you can’t start training until loading finishes, the loaded arrays stay resident for the entire run, and any shuffle operation has to work over the full dataset in memory. All of this compounds.&lt;/p&gt;

&lt;p&gt;There’s also a subtlety with the stroke files: they come pre-split into train/val/test partitions. If you want to do your own splits (which you do, so you can control the ratio and the random seed), you need to recombine them first and re-split yourself.&lt;/p&gt;

&lt;p&gt;So before we get to the loader itself, there are three preprocessing steps to run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Downloading the Data
&lt;/h3&gt;

&lt;p&gt;The download script fetches both file types from Google’s Cloud Storage. The listing endpoint returns XML, which the script parses to find the URLs for the classes you’ve defined in base_classes. Downloads run in parallel using a thread pool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;concurrent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ThreadPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;download_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;download_folder&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;file_urls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two separate calls — one for .npy raster files, one for .npz stroke files, filtered to the sketchrnn/ prefix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;download_quickdraw_files&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;file_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;npy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;download_quickdraw_files&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;file_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;npz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix_filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sketchrnn/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Files that already exist are skipped, which matters when you’re running this on a remote server where connections drop and you have to restart.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F43gyiac4oc4mno5t0gbg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F43gyiac4oc4mno5t0gbg.png" alt="Screenshot: terminal output during download — the [DOWNLOAD] and [SKIP] lines showing parallel fetching" width="773" height="260"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: terminal output during download — the [DOWNLOAD] and [SKIP] lines showing parallel fetching&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2: Recombining the Stroke Splits
&lt;/h3&gt;

&lt;p&gt;Each stroke .npz file has three keys: train, val, and test. Left as-is, you're working with a subset of the available data. The fix is to concatenate them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;concatenate&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;val&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;savez_compressed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs in parallel across all classes using ProcessPoolExecutor. One thing worth noting: after combining, gc.collect() is called explicitly. This is a multiprocessing context and Python's garbage collector doesn't always release memory between processes the way you'd expect. Without this, a machine with moderate RAM will start sweating as dozens of processes hold combined arrays simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: The Key Idea — One File Per Sample
&lt;/h3&gt;

&lt;p&gt;This is the decision everything else depends on.&lt;/p&gt;

&lt;p&gt;Instead of keeping each class as a single large .npy file, we explode every sample out into its own file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dataset/processed/
  images/
    cat/
      000001.npy ← shape: (28, 28, 1)
      000002.npy
      ...
  strokes/
    cat/
      000001.npy ← shape: (130, 3)
      000002.npy
      ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The conversion script loops over every class, loads the class-level arrays, preprocesses each sample, and saves them individually. The index is global across all classes — not per-class — which is what keeps image and stroke files aligned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;global_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="n"&gt;max_samples_per_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100_000&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;LABEL_MAP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;images&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mmap_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# note: memory-mapped
&lt;/span&gt;    &lt;span class="n"&gt;strokes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;stroke_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allow_pickle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latin1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
              &lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strokes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;max_samples_per_class&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;global_idx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
        &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;images/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;06&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.npy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
          &lt;span class="nf"&gt;preprocess_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strokes/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;06&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.npy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
          &lt;span class="nf"&gt;preprocess_strokes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;global_idx&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The image loading uses mmap_mode="r" - memory-mapped, so NumPy doesn't load the entire (100000, 784) array into RAM just to iterate over it row by row. The preprocessing happens at this stage, not at training time, so the generator later is just doing file reads.&lt;/p&gt;

&lt;p&gt;This step takes a while to run. On the upside, it runs once.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffoeh8buwpzssev3fuhkk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffoeh8buwpzssev3fuhkk.png" alt="Screenshot: the processed/ directory structure — showing the per-class subdirectories with numbered .npy files" width="379" height="534"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: the processed/ directory structure — showing the per-class subdirectories with numbered .npy files&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  What Preprocessing Actually Does
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Images&lt;/strong&gt; are straightforward. Reshape (784,) to (28, 28), divide by 255 to get [0, 1] floats, expand the channel dimension:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;flat_img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;255.0&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expand_dims&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# (28, 28, 1)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Strokes&lt;/strong&gt; are more involved. The raw data uses relative coordinates — each (dx, dy) is an offset from the previous point, not an absolute position. This makes sense for how drawings are recorded but not for how a model should see them. The preprocessing converts to absolute, centers the drawing at the origin, then scales to a fixed [-100, 100] range:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Relative -&amp;gt; absolute
&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cumsum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cumsum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Center at origin
&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Scale to [-100, 100]
&lt;/span&gt;&lt;span class="n"&gt;max_coord&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;max_coord&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="mf"&gt;100.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;max_coord&lt;/span&gt;
    &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="mf"&gt;100.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;max_coord&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stroke sequences are variable length. To get a fixed-size tensor for the LSTM, sequences are either truncated or zero-padded to 130 points. Why 130? Empirically, that covers the vast majority of drawings in the dataset without wasting too many zeros on the short ones.&lt;/p&gt;

&lt;p&gt;The pen state column (the third feature) is left as-is — it’s already a binary indicator of whether the pen is lifted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzppujve678gc3qvsd81.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzppujve678gc3qvsd81.png" width="800" height="341"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: before/after visualization of a stroke — raw relative coordinates as a mess of lines, then the centered/normalized version looking like the actual drawing&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Loader
&lt;/h3&gt;

&lt;p&gt;After preprocessing, the index step is fast. We walk the processed directory and collect all file paths:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;LABEL_MAP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;image_files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PROCESSED_DATA_DIR&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/images/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/*.npy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;stroke_files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PROCESSED_DATA_DIR&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/strokes/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/*.npy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_files&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stroke_files&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;SAMPLES_PER_CLASS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_files&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stroke_files&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point, images and strokes are just lists of strings. Nothing has been loaded into memory. The total dataset - 345 classes × 30,000 samples - indexes in a few seconds.&lt;/p&gt;

&lt;p&gt;There’s also a threshold in the config: IN_MEMORY_THRESHOLD = 30_000. If SAMPLES_PER_CLASS is below that number, the loader will actually call np.load() during indexing and store the arrays directly. For quick experiments on a subset of data, this avoids the per-sample I/O overhead at training time. For large runs, it streams from disk instead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;USE_IN_MEMORY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;USE_INDIVIDUAL&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SAMPLES_PER_CLASS&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;IN_MEMORY_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both paths feed into the same generator interface, which is a nice property — you can switch between them by changing one number.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Generator and the tf.data Pipeline
&lt;/h3&gt;

&lt;p&gt;The generator is a Python function that yields (image, stroke, one_hot_label) tuples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;data_generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;USE_IN_MEMORY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stroke&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stroke&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;one_hot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NUM_CLASSES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;USE_INDIVIDUAL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;img_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;str_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;yield &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img_path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
                    &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;str_path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
                    &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;one_hot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NUM_CLASSES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This feeds into a tf.data.Dataset via from_generator, which requires explicit output signatures - TensorFlow needs to know shapes and dtypes upfront since it can't infer them from a Python generator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;output_signature&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TensorSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TensorSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;130&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TensorSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NUM_CLASSES&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gen&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_signature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;output_signature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full pipeline adds shuffling (shuffles a buffer of 10× the batch size rather than the entire dataset), repeating, batching at 512, and prefetching:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gen&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_signature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;output_signature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_shuffle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shuffle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;format_sample&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_parallel_calls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AUTOTUNE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prefetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AUTOTUNE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The format_sample step reformats the yielded tuple into the dictionary format Keras expects for multi-input models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stroke&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stroke_input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;stroke&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Shuffling indices, not files, is important here. The file layout on disk stays sequential — images for cat are in one directory, images for airplane in another. The shuffle happens in the data pipeline as it reads, which avoids random I/O seeks across the disk. Sequential reads are substantially faster than random ones, and the OS page cache will warm up the recently accessed files naturally.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzbhigzyiwseo9odc0vn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzbhigzyiwseo9odc0vn.png" width="800" height="292"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: htop showing RAM usage during training — relatively flat, not growing with training time&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Splitting the Dataset
&lt;/h3&gt;

&lt;p&gt;The split is index-based. We shuffle a global index array once with a fixed seed, then slice it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shuffle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;train_end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;val_end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_end&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;train_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;train_end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;val_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train_end&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;val_end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;test_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;val_end&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 80/10/10 ratio applies across all classes since the indexing step already interleaved everything. There’s no risk of a class being entirely in the training set and absent from validation.&lt;/p&gt;

&lt;p&gt;Validation and test datasets use .take() to consume a fixed number of batches - computed from the split sizes - since the generator repeats indefinitely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;val_ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;val_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val_strokes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val_labels&lt;/span&gt;
          &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ceil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;test_ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;test_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_strokes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_labels&lt;/span&gt;
          &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ceil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What Went Wrong Along the Way
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;File count.&lt;/strong&gt; The processed dataset ends up with roughly 345 × 30,000 × 2 = 20.7 million files. Some filesystems handle this poorly. If you're on a filesystem with inode limits or slow directory listing (common with some HPC storage systems), the sorted(glob(...)) calls at index time can take several minutes. Structured subdirectories (one per class) help, but it's still a lot of files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Index alignment.&lt;/strong&gt; The global index scheme — where file names reflect position across all classes, not within a class — exists entirely to prevent a specific bug. An earlier version used per-class indices, which caused a silent alignment failure: image cat/000001.npy and stroke cat/000001.npy were always aligned, but after shuffling, the code was pulling from globally-indexed lists and the class-local numbering didn't correspond. The {idx:06d} naming ensures that whatever index you retrieve from the lists, the image and stroke file names will match.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training on a remote server with an unstable SSH connection.&lt;/strong&gt; The training history in the notebook has a gap. BackupAndRestore meant the model weights survived; the history object didn't. TensorBoard logs were the fallback, and the actual metrics are there - the notebook's loss and accuracy plots just show what was available from the Python history object after reconnecting. If you're doing long training runs remotely, save the history separately and frequently, not just at the end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory growth with TensorFlow’s GPU allocator.&lt;/strong&gt; By default, TensorFlow pre-allocates the entire GPU memory. For a machine shared with other users, or one running other processes, this is a problem. The fix is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;gpu&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;gpus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;experimental&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_memory_growth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes TensorFlow allocate GPU memory incrementally as needed. It’s not set by default because it can slightly reduce performance in some scenarios, but for shared environments it’s basically always the right call.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I’d Do Differently
&lt;/h3&gt;

&lt;p&gt;The main thing I’d want to add is parallel file loading. Right now the generator is single-threaded — it loads one sample at a time, yields it, repeats. tf.data.AUTOTUNE on the prefetch helps by trying to keep the pipeline filled ahead of the model's consumption, but the actual I/O is sequential. Adding multiple generator workers (like PyTorch's num_workers) would reduce the time the GPU spends waiting for data.&lt;/p&gt;

&lt;p&gt;LMDB would also be worth experimenting with. The advantage over millions of small files is that it’s a single file that supports fast key-value lookup, sequential reading, and doesn’t suffer from filesystem overhead per-entry. The disadvantage is that it complicates the setup and makes debugging harder. For this project the small-files approach was fast enough, but at larger scale it would start to matter.&lt;/p&gt;

&lt;p&gt;A smarter caching strategy (keeping recently accessed samples in a bounded RAM buffer) would also help with the “warm up” problem. The first epoch is always slower than subsequent ones because the OS page cache starts cold. A pre-warmed in-memory buffer for the most frequently accessed samples would smooth that out.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Part That Surprised Me
&lt;/h3&gt;

&lt;p&gt;When I first sketched this out, my expectation was that disk-based loading would be noticeably slower than loading everything to RAM — enough to be a real bottleneck. It wasn’t, for a reason that only became clear after thinking about it: individual .npy file loads are fast. A (28, 28, 1) array at float32 is 3,136 bytes. A (130, 3) stroke array is 1,560 bytes. These are tiny files. The actual read time per sample is in the low microseconds, and the OS cache handles repeat accesses to recently-read files transparently.&lt;/p&gt;

&lt;p&gt;What you trade away compared to pure in-memory loading is predictability. With everything in RAM, access time is constant. With disk loading, you’re occasionally hitting a file that isn’t cached, and that read takes longer. In practice, the prefetch buffer absorbs most of this variance. The GPU never actually sat idle waiting for data in my runs — the bottleneck was always computation, not I/O.&lt;/p&gt;

&lt;p&gt;The other thing that surprised me was how much the single-file-per-class approach had been hiding. When everything for cat is one big (100000, 784) array, you have no choice but to load the whole thing before you can access any of it. That's a loading cost you pay every time. With individual files, you pay per sample - and you only pay for the samples you actually use.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Notebook Setup (in case it’s useful)
&lt;/h3&gt;

&lt;p&gt;One thing worth mentioning for anyone running this on a remote server: the port forwarding setup for Jupyter. If you’re SSH-ing into a machine and want to run notebooks rather than pulling .py files and running them in screen sessions, you forward the Jupyter port to localhost:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ssh"&gt;&lt;code&gt;&lt;span class="k"&gt;ssh&lt;/span&gt; -L &lt;span class="m"&gt;8888&lt;/span&gt;:localhost:8888 user@server_ip

&lt;span class="c1"&gt;# On the server:&lt;/span&gt;
&lt;span class="k"&gt;jupyter&lt;/span&gt; notebook --no-browser --port=8888
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you’re going through two layers of SSH (e.g. a department gateway server that routes to a compute node), you just carry the forwarding through:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ssh"&gt;&lt;code&gt;&lt;span class="k"&gt;ssh&lt;/span&gt; -L &lt;span class="m"&gt;8888&lt;/span&gt;:localhost:8888 user@gateway
&lt;span class="c1"&gt;# On gateway:&lt;/span&gt;
&lt;span class="k"&gt;ssh&lt;/span&gt; -L &lt;span class="m"&gt;8888&lt;/span&gt;:localhost:8888 user@compute_node
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And for full control over Python and library versions, running the kernel inside a virtual environment is worth the setup time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3.12 &lt;span class="nt"&gt;-m&lt;/span&gt; venv .tens
&lt;span class="nb"&gt;source&lt;/span&gt; .tens/bin/activate
pip &lt;span class="nb"&gt;install &lt;/span&gt;jupyter ipykernel tensorflow numpy tqdm matplotlib
python &lt;span class="nt"&gt;-m&lt;/span&gt; ipykernel &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--user&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;.tens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then you can select .tens as the kernel in Jupyter and know exactly what Python version and library versions are running - which matters if you're planning to later quantize the model and deploy it somewhere like a Raspberry Pi, where the environment constraints are much stricter.&lt;/p&gt;

&lt;p&gt;The pipeline ended up being more engineered than I originally wanted. But it runs, it doesn’t crash, and it’ll scale to more classes or more samples per class without changes. For a dataset this size on a memory-constrained machine, that’s the bar.&lt;/p&gt;

&lt;p&gt;The code is all in the &lt;a href="https://github.com/YuvrajRaghuvanshiS/doodle-vision" rel="noopener noreferrer"&gt;repository&lt;/a&gt; if you want to look at the actual implementation rather than the edited excerpts here. I’ll make this public once the paper is accepted.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is rewritten using AI chatbots.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;April 14, 2026&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Reverse Engineering SmartLock by Parivahan: What I Found Inside a Python Proctoring App</title>
      <dc:creator>Yuvraj Raghuvanshi</dc:creator>
      <pubDate>Tue, 07 Apr 2026 12:14:33 +0000</pubDate>
      <link>https://dev.to/yuvrajraghuvanshis/reverse-engineering-smartlock-by-parivahan-what-i-found-inside-a-python-proctoring-app-3oda</link>
      <guid>https://dev.to/yuvrajraghuvanshis/reverse-engineering-smartlock-by-parivahan-what-i-found-inside-a-python-proctoring-app-3oda</guid>
      <description>&lt;p&gt;I didn’t plan to reverse engineer a proctoring application. I just wanted to understand why a page kept refreshing in an infinite loop.&lt;/p&gt;

&lt;p&gt;That one puzzling symptom ended up pulling me down a rabbit hole that took days to climb out of — involving PyInstaller internals, broken decompilers, browser automation quirks, and a race condition that convincingly pretended to be tamper detection. The journey was longer than I expected, and honestly more interesting. So I figured I might as well write it up.&lt;/p&gt;

&lt;p&gt;The application in question is &lt;strong&gt;SmartLock by Parivahan&lt;/strong&gt; , a proctoring system used for government driver’s learning license exams in India, and yes, I am getting a driving license at the age of 25. We all start somewhere.&lt;/p&gt;

&lt;p&gt;It’s a Python desktop app that locks down your machine, watches your screen, monitors USB ports, controls your browser, and talks to a remote server — all at the same time. Understanding how it does all of this, and how the pieces fit together, is what this article is about.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Opening the Box
&lt;/h3&gt;

&lt;p&gt;The first thing I did was look at the installation directory. This usually tells you a lot before you write a single line of code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;_internal/
browser/
config/
log/
Pictures/
Smartlock.exe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That _internal/ folder was the giveaway. It's a classic PyInstaller signature. The folder contains Python libraries and compiled bytecode - essentially, a self-contained Python runtime bundled into a single executable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdfaenv29sva28s1ckkk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdfaenv29sva28s1ckkk.png" alt="Screenshot: Installation directory structure showing _internal/ folder and Smartlock.exe" width="373" height="266"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: Installation directory structure showing _internal/ folder and Smartlock.exe&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So the application was built in Python and packaged using PyInstaller. That meant extraction was possible using &lt;a href="https://github.com/extremecoders-re/pyinstxtractor" rel="noopener noreferrer"&gt;pyinstxtractor&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python pyinstxtractor.py Smartlock.exe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After extraction, the structure looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Smartlock.exe_extracted/
├── PYZ-00.pyz_extracted/
│ ├── asyncio/
│ ├── psutil/
│ ├── pydivert/
│ ├── selenium/
│ ├── websockets/
│ ├── win32com/
│ ├── yaml/
│ ├── controller.pyc
│ ├── registry_edit.pyc
│ └── ...
├── core.pyc
└── ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most of what you see here is noise — third-party libraries. The signal is in the handful of .pyc files: core.pyc, controller.pyc, registry_edit.pyc. These contain the actual application logic. Everything else is plumbing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2: Decompilation and Why It’s Always Messier Than It Sounds
&lt;/h3&gt;

&lt;p&gt;This is where things got annoying.&lt;/p&gt;

&lt;p&gt;I tried the standard tools first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uncompyle6 core.pyc
decompyle3 core.pyc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both failed. Version mismatch — the bytecode was compiled with a Python version these tools didn’t fully support. I eventually got somewhere using &lt;a href="https://pychaos.io/" rel="noopener noreferrer"&gt;pychaos&lt;/a&gt;, but I want to be honest about what “decompiled code” actually looks like in practice. It’s not clean. Comments are gone (they’re never stored in bytecode). Control flow gets reconstructed heuristically and is often wrong. You get artifacts like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;__CHAOS_PY_TEST_NOT_INIT_ERR__&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the actual code probably looked something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The decompiler is doing its best, but it’s guessing. Reverse engineering at this level is less about reading code and more about reconstructing intent from imperfect evidence. You develop a feel for what the code is &lt;em&gt;trying&lt;/em&gt; to do, even when the syntax is broken.&lt;/p&gt;

&lt;p&gt;The three key files broke down roughly like this:&lt;/p&gt;

&lt;p&gt;core.pyc Main orchestrator - startup, thread management&lt;/p&gt;

&lt;p&gt;controller.pyc Enforcement logic - monitoring, detection&lt;/p&gt;

&lt;p&gt;registry_edit.pyc OS-level restrictions - registry modifications&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3: Reconstructing the Architecture
&lt;/h3&gt;

&lt;p&gt;Once I had a working (if imperfect) picture of the code, the overall architecture became clear:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsds92c40ok5jb8swdii9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsds92c40ok5jb8swdii9.png" width="800" height="978"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: detailed diagram showcasing the connections between SmartLock application, bundled Chrome application, and remote server&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It’s a tightly coupled system. The desktop app and the browser aren’t independent — they’re in constant communication. And both of them are talking to the remote server. Remove any one of these connections and the whole thing breaks.&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 4: The Browser That Kept Redirecting
&lt;/h3&gt;

&lt;p&gt;After reconstructing and running the application, I ran into something strange.&lt;/p&gt;

&lt;p&gt;The exam webpage was redirecting continuously to a 403.jsp webpage and in a split second back to exam login webpage. Every few seconds — reload, reload, reload. My first instinct was that this was intentional tamper detection. After all, the whole point of a proctoring system is to detect when something isn’t right. Maybe it had detected something about my environment and was punishing me with an infinite loop.&lt;/p&gt;

&lt;p&gt;That turned out to be wrong. But figuring out &lt;em&gt;why&lt;/em&gt; it was wrong took a while.&lt;/p&gt;

&lt;p&gt;The browser bundled with SmartLock isn’t a standard Chrome installation. It’s a portable Chromium build with a preconfigured user profile:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;browser/
├── App/
│ └── Chrome-bin/
├── Data/
│ └── profile/
│ └── Default/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I inspected the stored cookies, session tokens, cached scripts, and extensions looking for some kind of tamper detection artifact. Nothing useful. The refreshing wasn’t coming from stored state.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7oxw3343h3j7ky19tqjn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7oxw3343h3j7ky19tqjn.png" width="378" height="389"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: browser/ directory structure&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 5: SmartSocket.js — Where It All Connected
&lt;/h3&gt;

&lt;p&gt;The actual cause was in the exam webpage itself. Buried in the page source was this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;script &lt;/span&gt;&lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"SmartLock/SmartSocket.js"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This script establishes a WebSocket connection to the local application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;socket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WebSocket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ws://localhost:8000/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the critical link. The browser doesn’t just display the exam — it actively depends on the local application being alive and reachable. As soon as the connection is established, the browser authenticates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;reqOb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Authentication&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nx"&gt;reqOb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;1234&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nx"&gt;reqOb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;appl_no&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nx"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;reqOb&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And if the connection fails — even momentarily — the page clears the session and reloads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// On connection failure:&lt;/span&gt;
&lt;span class="c1"&gt;// → Clear session&lt;/span&gt;
&lt;span class="c1"&gt;// → Redirect or reload&lt;/span&gt;
&lt;span class="c1"&gt;// → UI resets to initial state&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not tamper detection. It’s a strict runtime dependency. The browser requires the local WebSocket server to be up &lt;em&gt;before&lt;/em&gt; it finishes loading. If it isn’t, you get an infinite reload loop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F29jg4lht5kvh2be0sfge.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F29jg4lht5kvh2be0sfge.png" alt="Screenshot: SmartSocket.js connection code in browser devtools" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: SmartSocket.js connection code in browser devtools&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 6: The Race Condition
&lt;/h3&gt;

&lt;p&gt;With that understanding, the root cause became obvious. The application starts the WebSocket server and the browser in parallel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;thread1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;start_websocket&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;thread2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;launch_browser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem with this is that “starting” the WebSocket server takes a moment. The browser, however, is fast — it loads the page, runs the script, and tries to connect to localhost:8000 before the server is actually ready. Connection fails. Page reloads. Tries again. Same thing. Infinite loop.&lt;/p&gt;

&lt;p&gt;The sequence of events looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser loads page
 → SmartSocket.js executes immediately
 → Attempts WebSocket connection to localhost:8000
 → Server not ready yet
 → Connection refused
 → Page session cleared
 → Page reloads
 → Same thing happens again
 → ...forever
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It perfectly mimicked tamper detection behavior, which is why I assumed that’s what it was. But it was just a timing issue.&lt;/p&gt;

&lt;p&gt;The fix is simple: wait for the server to be ready before launching the browser.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;browser_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sock&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_connection&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;OSError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Now launch browser
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this in place, the correct sequence is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Start App
 → Start WebSocket server
 → Poll until server is accepting connections
 → Launch browser
 → WebSocket connects successfully
 → Authentication succeeds
 → Exam proceeds normally
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj14za7htls7g21ap02a2.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj14za7htls7g21ap02a2.gif" width="600" height="338"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: Before — infinite reload loop vs successful startup&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 7: What the App Is Actually Doing Under the Hood
&lt;/h3&gt;

&lt;p&gt;Once the startup problem was solved, I could look more carefully at all the enforcement mechanisms running in the background. There’s quite a lot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OS-Level Lockdown&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The registry editor modifies Windows to disable the usual escape routes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="py"&gt;DisableTaskMgr&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;1&lt;/span&gt;
&lt;span class="py"&gt;DisableLockWorkstation&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;1&lt;/span&gt;
&lt;span class="py"&gt;NoLogoff&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This disables Task Manager, the lock screen, and the ability to log off. The Ctrl+Alt+Del menu effectively becomes useless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Process Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The monitoring engine maintains a list of software that shouldn’t be running during an exam. Screen recording and virtual camera tools are specifically targeted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OBS Studio&lt;/li&gt;
&lt;li&gt;ManyCam&lt;/li&gt;
&lt;li&gt;XSplit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of these processes are detected, a violation is flagged.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;USB Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The app takes a snapshot of connected USB devices at startup and watches for changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_usb&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;initial_usb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;flag_violation&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plugging in a USB drive during the exam is treated as a potential integrity violation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-Monitor and VM Detection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Multiple displays are blocked. The app also checks whether it’s running inside a virtual machine — which would make it easier to manipulate the environment without being detected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network Filtering via pydivert&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This was the most interesting piece. The app uses pydivert - a Python wrapper around WinDivert - to implement packet-level network filtering. During an exam, only certain destinations are allowed. Everything else is dropped at the kernel level.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 8: Two WebSockets, Not One
&lt;/h3&gt;

&lt;p&gt;I initially assumed there was a single WebSocket connection: browser to local app. There are actually two:&lt;/p&gt;

&lt;p&gt;Local WebSocket (localhost:8000) Browser ↔ Desktop App&lt;/p&gt;

&lt;p&gt;Remote WebSocket Desktop App ↔ Remote Server&lt;/p&gt;

&lt;p&gt;The local one handles session management for the browser. The remote one is for continuous telemetry — the app regularly sends status updates to the server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"USB"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ProcessCheck"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"detected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server isn’t passive. It’s continuously validating that the client is behaving correctly. If the telemetry stops or reports a violation, the server can terminate the session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All the connection details live in a YAML config file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;ExamIp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;164.100.69.5&lt;/span&gt;
&lt;span class="py"&gt;ExamUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://sarathi.parivahan.gov.in/sarathiservice/authenticationaction.do?authtype=Anugyna&lt;/span&gt;
&lt;span class="py"&gt;SocketPort&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;8000&lt;/span&gt;
&lt;span class="py"&gt;SocketServerPort&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3000&lt;/span&gt;
&lt;span class="py"&gt;SocketUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ws://sarathi.parivahan.gov.in&lt;/span&gt;
&lt;span class="py"&gt;StatusApiUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://sarathi.parivahan.gov.in/sarathiWS/rsServices/smartLockCheck/smartLockCheck&lt;/span&gt;
&lt;span class="py"&gt;ViolationApiUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://sarathicov.nic.in:8443/sarathiWS/rsServices/smartLockCheck/examViolation&lt;/span&gt;
&lt;span class="py"&gt;primaryServerIPV4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.172.31.33&lt;/span&gt;
&lt;span class="py"&gt;primaryServerIPV6&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2001:4408:7204:8:5d93:8239:8876:d238&lt;/span&gt;
&lt;span class="py"&gt;secondaryServerIPV4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.172.31.30&lt;/span&gt;
&lt;span class="py"&gt;secondaryServerIPV6&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2001:4408:7204:9::aac:2033&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means the behavior is somewhat server-controlled. The exam URL, the socket address, the session logic — it’s all configured externally, which makes the server the real authority over how the session runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 9: The Firewall
&lt;/h3&gt;

&lt;p&gt;One more thing worth mentioning: the app appears to interact with the Windows Firewall directly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;fw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pbox_fw_backup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pbox_bkp.wfw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This exports the current firewall rules to a backup file before modifying them. The behavior is that the app replaces your firewall rules with its own restricted ruleset for the duration of the exam, then restores the backup afterward. It may also hash the backup to detect if someone has tampered with the rules mid-session.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 10: Why You Can’t Just Rebuild It
&lt;/h3&gt;

&lt;p&gt;One last I want to address directly: extracting the PyInstaller binary does not give you a working copy of the application. There’s a common misconception that extraction = reconstruction. It doesn’t.&lt;/p&gt;

&lt;p&gt;The workflow for actually rebuilding would be:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Decompile .pyc files to .py&lt;/li&gt;
&lt;li&gt;Manually correct the decompilation errors&lt;/li&gt;
&lt;li&gt;Rebuild with PyInstaller&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Steps 1 and 2 are where it falls apart in practice. The decompiled code has inaccuracies that aren’t always obvious. Some of the control flow is wrong in subtle ways that only become apparent at runtime. There are also timing dependencies baked into the threading model, and the server-side validation means you’d need a cooperating server to test anything properly.&lt;/p&gt;

&lt;p&gt;The security here doesn’t come from any single mechanism being unbreakable. It comes from the combination:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The local app monitors the OS&lt;/li&gt;
&lt;li&gt;The browser depends on the local app&lt;/li&gt;
&lt;li&gt;The server monitors the local app&lt;/li&gt;
&lt;li&gt;The network is filtered at the kernel level&lt;/li&gt;
&lt;li&gt;The firewall is replaced during the session&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each layer on its own is probably defeatable. Together, they create a system where defeating one layer doesn’t help much because the others remain intact.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I Took Away From This
&lt;/h3&gt;

&lt;p&gt;A few things stuck with me after this investigation.&lt;/p&gt;

&lt;p&gt;The infinite reload loop was genuinely convincing as tamper detection. I spent more time than I’d like to admit looking for a security mechanism that wasn’t there. The lesson is that emergent behavior from a race condition can look exactly like intentional defensive behavior. Don’t assume intent before you’ve traced the actual execution path.&lt;/p&gt;

&lt;p&gt;The browser is doing real security work here, not just displaying a UI. SmartSocket.js is the link between the exam session and the local enforcement system. If that connection breaks, the exam can't proceed. That's a deliberate architectural choice, not an accident.&lt;/p&gt;

&lt;p&gt;And PyInstaller extraction, while possible, is just the beginning. The hard part isn’t getting the bytecode out. It’s making sense of what the decompiler gives you and reconstructing what the developer originally meant to write.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Could Come Next
&lt;/h3&gt;

&lt;p&gt;If I were going to continue this investigation, the natural directions would be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mapping the full WebSocket protocol between browser and app, and between app and server&lt;/li&gt;
&lt;li&gt;Tracing the telemetry payloads to understand exactly what data gets sent and when&lt;/li&gt;
&lt;li&gt;Building a sequence diagram for the full session lifecycle, from startup through exam completion&lt;/li&gt;
&lt;li&gt;Looking more carefully at the firewall manipulation and how (or whether) it detects tampering with the backed-up rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of that fits in one article. But the architecture is now clear enough that any of those threads could be pulled independently.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwz0dbz8bvw1ackgxjbh3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwz0dbz8bvw1ackgxjbh3.png" alt="Screenshot: Final — application running normally with exam loaded, WebSocket connection established" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: Final — application running normally with exam loaded, WebSocket connection established&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is for educational purposes — understanding how production security systems are architected and why they’re difficult to tamper with. The focus throughout has been on the design and behavior of the system, not on defeating it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is rewritten using AI chatbots.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;April 07, 2026&lt;/em&gt;&lt;/p&gt;

</description>
      <category>websocket</category>
      <category>python</category>
      <category>reverseengineering</category>
    </item>
  </channel>
</rss>
