<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yuvraj Raghuvanshi</title>
    <description>The latest articles on DEV Community by Yuvraj Raghuvanshi (@yuvrajraghuvanshis).</description>
    <link>https://dev.to/yuvrajraghuvanshis</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3905869%2F3bf3b41f-3acf-43f7-8e09-103390db5ac8.png</url>
      <title>DEV Community: Yuvraj Raghuvanshi</title>
      <link>https://dev.to/yuvrajraghuvanshis</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yuvrajraghuvanshis"/>
    <language>en</language>
    <item>
      <title>The Website That Looked Like It Needed Selenium (But Didn’t)</title>
      <dc:creator>Yuvraj Raghuvanshi</dc:creator>
      <pubDate>Thu, 30 Apr 2026 10:33:32 +0000</pubDate>
      <link>https://dev.to/yuvrajraghuvanshis/the-website-that-looked-like-it-needed-selenium-but-didnt-1p1</link>
      <guid>https://dev.to/yuvrajraghuvanshis/the-website-that-looked-like-it-needed-selenium-but-didnt-1p1</guid>
      <description>&lt;p&gt;For my thesis I needed a large corpus of Hindi poetry. &lt;a href="https://hindwi.org" rel="noopener noreferrer"&gt;Hindwi&lt;/a&gt; is one of the better maintained Hindi literature archives on the internet. Thousands of poems, hundreds of poets, content spanning from the 8th century to contemporary writers. It had everything I needed.&lt;/p&gt;

&lt;p&gt;I didn’t plan to spend much time on the scraper. Collect the data, move on.&lt;/p&gt;

&lt;p&gt;That didn’t happen.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Obvious Problem
&lt;/h3&gt;

&lt;p&gt;Visit &lt;a href="https://hindwi.org/poets" rel="noopener noreferrer"&gt;hindwi.org/poets&lt;/a&gt; and you’ll see a listing of poets. Scroll down and more appear. Visit an individual poet’s page and the same thing happens — poems load as you scroll. This is the pattern that makes every scraper writer reach for Selenium almost reflexively. The content isn’t in the initial HTML. JavaScript is loading it dynamically. You need a browser.&lt;/p&gt;

&lt;p&gt;So I set up Selenium. Headless Chrome, scroll simulation, wait for elements to appear, extract content. It worked. It was also agonizingly slow.&lt;/p&gt;

&lt;p&gt;The real problem wasn’t just speed — it was that Selenium is fundamentally impractical to parallelize. You can’t easily spin up ten browser instances and scrape ten poets simultaneously the way you can with threads making HTTP requests. Each browser instance carries its own rendering engine, memory space, and JavaScript runtime. The resource cost compounds quickly, and the coordination between instances is a nightmare. Even with aggressive parallelism, back-of-envelope math on 25,000+ poems made it clear this would take days, not hours.&lt;/p&gt;

&lt;p&gt;There had to be a better way.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ten Minutes in DevTools
&lt;/h3&gt;

&lt;p&gt;Before writing any more Selenium code, I opened the browser DevTools Network tab and watched what actually happened when the page loaded more content.&lt;/p&gt;

&lt;p&gt;This is always worth doing before committing to browser automation. Dynamic-looking behavior on the frontend is still, at the network level, just HTTP requests. The browser has to get the data from somewhere. The question is whether that somewhere is directly reachable.&lt;/p&gt;

&lt;p&gt;On Hindwi, when you scroll to the bottom of the poets listing, the browser fires a request like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://www.hindwi.org/PoetCollection?lang=2&amp;amp;pageNumber=2&amp;amp;Info=poet
&amp;amp;StartsWith=&amp;amp;keyword=&amp;amp;typeID=659186cb-44e7-4d94-8b1a-fc70f939a733
&amp;amp;TypeSlug=poets&amp;amp;contentFilter=&amp;amp;_=1777462454692
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plain GET request. No authentication tokens in the body, no encrypted signatures, no WebSocket handshake. Just query parameters. The _=1777462454692 at the end is a cache-busting timestamp the browser adds automatically - the server doesn't validate it, so scrapers can ignore it entirely.&lt;/p&gt;

&lt;p&gt;The response that came back was raw HTML — not JSON, not XML. Just HTML cards containing poet names, dates, and profile links, ready to be injected into the DOM. So the website wasn’t serving a proper API, but it was serving something structured, paginated, and directly reachable over plain HTTP.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vg5lsuq7f8x7a9jnxv5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vg5lsuq7f8x7a9jnxv5.png" width="800" height="208"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: DevTools Network tab showing the /PoetCollection request and its HTML response body&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The next question was: how does the browser know what URL to request for page 3, page 4, page 5? Where does that information come from?&lt;/p&gt;
&lt;h3&gt;
  
  
  The URL Was Sitting Right There
&lt;/h3&gt;

&lt;p&gt;I looked at the page source. And there they were — all of them, already embedded in the initial HTML response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"contentLoadMore"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"contentLoadMorePaging"&lt;/span&gt; 
         &lt;span class="na"&gt;data-url=&lt;/span&gt;&lt;span class="s"&gt;"/PoetCollection?lang=2&amp;amp;pageNumber=3&amp;amp;Info=poet
                  &amp;amp;StartsWith=&amp;amp;keyword=&amp;amp;typeID=659186cb-44e7-4d94-8b1a-fc70f939a733
                  &amp;amp;TypeSlug=poets&amp;amp;contentFilter="&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
        &lt;span class="nt"&gt;&amp;lt;svg&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"screenLoader"&lt;/span&gt; &lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/svg&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The site pre-embeds the URLs for every subsequent page inside data-url attributes on div.contentLoadMorePaging elements. The JavaScript reads these attributes and fires the requests when you scroll into view. But from a scraper's perspective, the URLs are already there in the first response - you don't need to scroll anything. You just parse them out and fetch them directly.&lt;/p&gt;

&lt;p&gt;This was the moment Selenium became irrelevant.&lt;/p&gt;

&lt;p&gt;What looked like dynamic JavaScript-driven content was really just a simple pattern: fetch the initial page, extract the hidden data-url values, make those HTTP requests directly. No browser. No scroll simulation. No waiting for DOM mutations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28ve7ifmf5sak1r2ig2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28ve7ifmf5sak1r2ig2g.png" width="800" height="208"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: page source with the contentLoadMorePaging div and data-url attribute visible&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Same Pattern, Everywhere
&lt;/h3&gt;

&lt;p&gt;Once I knew what to look for, I checked the individual poet pages. Same pattern. A poet with more than 50 poems (Mona Gulati, for example) has this in her initial page response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"contentLoadMore"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"contentLoadMorePaging"&lt;/span&gt; 
         &lt;span class="na"&gt;data-url=&lt;/span&gt;&lt;span class="s"&gt;"/PoetCollection?lang=2&amp;amp;pageNumber=2&amp;amp;info=ghazals
                  &amp;amp;SEO_Slug=kavita&amp;amp;Id=34074990-5be7-43e9-8a85-6aaa0be4833c
                  &amp;amp;Info=ghazal&amp;amp;StartsWith=a&amp;amp;typeID=659186cb-...
                  &amp;amp;contentType=kavita&amp;amp;sort=popularity-desc&amp;amp;filter="&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"contentLoadMorePaging"&lt;/span&gt; 
         &lt;span class="na"&gt;data-url=&lt;/span&gt;&lt;span class="s"&gt;"/PoetCollection?lang=2&amp;amp;pageNumber=3..."&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both page 2 and page 3 are listed upfront in the first response. The site hands you the complete roadmap immediately. Fetch once, and you know exactly what to fetch next — no interaction, no scrolling, no waiting.&lt;/p&gt;

&lt;p&gt;This held for dohas, quotes, and every other content type on the site. The contentLoadMorePaging pattern was consistent across all of Hindwi. Understanding it once meant the whole site was open.&lt;/p&gt;

&lt;h3&gt;
  
  
  Turning the Insight Into Code
&lt;/h3&gt;

&lt;p&gt;The scraper that came out of this is conceptually simple. For the poet listing, hit the /PoetCollection endpoint and keep incrementing pageNumber until you get an empty response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_paginated_poet_cards&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;extra_params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lang&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pageNumber&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Info&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;extra_params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;extra_params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_soup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;POETS_ENDPOINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;cards&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.poetColumn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;cards&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;cards&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For poem lists, fetch the poet’s kavita page, parse whatever poems are already in the initial HTML, then extract and follow every data-url:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_extract_poem_metadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kavita_url&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_soup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kavita_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;poems&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_parse_poem_list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;pagination_divs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.contentLoadMorePaging[data-url]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;seen_urls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;div&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pagination_divs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;data_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;div&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;data_url&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;data_url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;seen_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;seen_urls&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;full_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;urljoin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.hindwi.org&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;paginated_soup&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_soup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;full_url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;poems&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_parse_poem_list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paginated_soup&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;poems&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No browser. No scroll events. Two BeautifulSoup calls per paginated poet.&lt;/p&gt;

&lt;p&gt;One thing worth mentioning about _parse_poem_list: the initial page and the dynamically loaded fragment pages use different CSS classes for their poem cards. The initial listing uses div.rt_contentBodyListItems, while the paginated HTML fragments come back using div.contentListItems.nwPoetListBody. I caught this when certain poets were returning suspiciously fewer poems than their profile pages suggested - the paginated content was being silently skipped because the selector only matched the first class. A multi-selector handles both:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cards&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.rt_contentBodyListItems, div.contentListItems.nwPoetListBody&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is exactly the kind of thing that produces wrong results silently. No error, no exception — just a poem count that’s quietly lower than it should be.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furbs8krjxj9dlnkfgk23.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Furbs8krjxj9dlnkfgk23.png" width="705" height="424"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: terminal output showing a poet being processed with their correct poem count&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Extracting the Poems
&lt;/h3&gt;

&lt;p&gt;Each poem lives on its own URL. The page serves the text in Devanagari and, for many poems, a Romanized transliteration toggled by a button. In the HTML, both versions are already present — just hidden or shown depending on which toggle is active:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Devanagari
&lt;/span&gt;&lt;span class="n"&gt;hindi_div&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pMC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-roman&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;off&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Romanized
&lt;/span&gt;&lt;span class="n"&gt;roman_div&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;soup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HindwiRoman&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;roman_pmc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;roman_div&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;class&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pMC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-roman&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The text itself is structured as &lt;/p&gt;
&lt;p&gt; tags containing &lt;span&gt; tags per word or phrase. Joining the spans within each paragraph gives one line:&lt;br&gt;
&lt;/span&gt;&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;hindi_div&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;span&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;span&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;hindi_lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Both versions get saved as separate plain text files. Not every poem has a Romanized version, so the code returns None for the roman field when it doesn't exist rather than an empty list - preserving the distinction between "no Roman version" and "Roman version is blank."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz8gznijxuhvda31pam53.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz8gznijxuhvda31pam53.png" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: a poem page on Hindwi showing the Devanagari text alongside the Roman toggle&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Concurrency — The Real Payoff
&lt;/h3&gt;

&lt;p&gt;With Selenium out of the picture, threading became trivial. The poem scraper processes all poets concurrently with a thread pool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_poems&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;ThreadPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;futures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_process_poet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;poet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                   &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;poet&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;poets&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;future&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;as_completed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;future&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ten threads making lightweight HTTP requests is nothing. This is what was completely impractical with Selenium — ten browser instances would have needed a dedicated server to run without thrashing. Ten requests threads ran fine on a laptop, barely registering on the CPU.&lt;/p&gt;

&lt;p&gt;Every request goes through a shared get_soup wrapper that enforces a 1-second politeness delay and retries with exponential backoff on failures. Errors at any level - a single poem, an entire poet - get logged and skipped rather than crashing the thread. The run completed cleanly over about two hours. A small number of URLs consistently returned server errors and landed in the log; everything else went through without issue.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Result
&lt;/h3&gt;

&lt;p&gt;Two hours. 25,000+ poems across hundreds of poets. Devanagari and Romanized versions where available. Structured metadata including titles, URLs, slugs, and categories per poem. Around 300MB of text in total.&lt;/p&gt;

&lt;p&gt;The dependency list tells the whole story:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;beautifulsoup4==4.13.4
requests==2.32.4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No Selenium, no browser drivers, no Playwright, no headless Chrome. Just HTTP requests and HTML parsing.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I Took From This
&lt;/h3&gt;

&lt;p&gt;The instinct to reach for Selenium when you see dynamic content is understandable — it’s the safe default that definitely works. But dynamic content loading just means the browser is making HTTP requests after the initial page load. Those requests go somewhere, return something, and in most cases can be replicated directly.&lt;/p&gt;

&lt;p&gt;The contentLoadMorePaging pattern on Hindwi is a good illustration of how often websites like this are more accessible than they appear. The site wasn't hiding anything. It was handing out pagination URLs in plain HTML, sitting in data-url attributes, ready to be read. JavaScript happened to be the first thing reading them - until a scraper was.&lt;/p&gt;

&lt;p&gt;Ten minutes in the Network tab before writing any scraping code is almost always worth it. In this case, it was the difference between days of Selenium pain and a two-hour requests script that finished before lunch.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is for educational purposes — all ethical considerations have been addressed, including measures such as rate limiting and conducting scraping during periods of low website traffic.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is rewritten using AI chatbots.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;April 30, 2026&lt;/em&gt;&lt;/p&gt;

</description>
      <category>reverseengineering</category>
      <category>python</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Training a Classifier on Huge Dataset When RAM Is Not Your Friend</title>
      <dc:creator>Yuvraj Raghuvanshi</dc:creator>
      <pubDate>Mon, 13 Apr 2026 18:05:03 +0000</pubDate>
      <link>https://dev.to/yuvrajraghuvanshis/training-a-classifier-on-huge-dataset-when-ram-is-not-your-friend-kle</link>
      <guid>https://dev.to/yuvrajraghuvanshis/training-a-classifier-on-huge-dataset-when-ram-is-not-your-friend-kle</guid>
      <description>&lt;p&gt;I didn’t set out to build a custom data loader. I set out to train a model on the Quick, Draw! dataset.&lt;/p&gt;

&lt;p&gt;The data pipeline was supposed to be the boring part — the few lines you write before the interesting work starts. It ended up being most of the work, the source of the most frustrating bugs, and, in retrospect, the most interesting engineering decision of the whole project.&lt;/p&gt;

&lt;p&gt;This is the story of why I ended up with a directory containing millions of individual .npy files, and why that turned out to be the right call.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Quick, Draw! Actually Is
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/googlecreativelab/quickdraw-dataset" rel="noopener noreferrer"&gt;Quick, Draw!&lt;/a&gt; is a Google dataset of human drawings collected from a browser game where players had 20 seconds to draw a prompted word. It has 345 categories — cats, airplanes, zigzags, The Eiffel Tower — with up to 100,000 drawings per class. That’s about 50 million drawings in total.&lt;/p&gt;

&lt;p&gt;What makes it interesting for ML, and annoying for data pipelines, is that each drawing has two representations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raster images&lt;/strong&gt;  — each drawing rendered as a 28×28 grayscale bitmap, stored as a flat array of 784 values. These come in .npy files where a single file for one class contains an array of shape (N, 784). For 100,000 samples, that's 100,000 rows of 784 floats per file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stroke sequences&lt;/strong&gt;  — the original drawing data: a sequence of (dx, dy, pen_state) triplets representing how the pen moved. These come in .npz files, split into train, val, and test keys. The stroke data varies in length per drawing - a simple zigzag might have 10 points, a detailed drawing of The Great Wall of China might have hundreds.&lt;/p&gt;

&lt;p&gt;The model I wanted to build was multimodal: it would take both representations as input simultaneously, letting a CNN process the image and an LSTM process the stroke sequence, then merge their outputs for classification. Which meant the pipeline had to serve both modalities in sync, for every sample, across 345 classes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fheuo2b4r54xvhjx3d41s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fheuo2b4r54xvhjx3d41s.png" alt="Screenshot: a sample drawings from Quick, Draw! — both the raster image and stroke visualization side by side" width="800" height="370"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: a sample drawings from Quick, Draw! — both the raster image and stroke visualization side by side&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Naive Approach and Why It Dies
&lt;/h3&gt;

&lt;p&gt;The obvious first attempt is the one-liner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat.npy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# shape: (~100000, 784)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That loads fine for one class. You run it for a few classes, you’re still fine. Then somewhere around class 20 or 30 your process gets killed by the OOM killer, or your Jupyter kernel crashes silently, or the remote server you’ve SSH’d into drops your connection and takes your training run with it.&lt;/p&gt;

&lt;p&gt;With 345 classes at 30,000 samples each (my chosen limit) — we’re talking about loading roughly 10 million samples into RAM at startup. At around 11% of a 128GB server’s memory for 10,000 samples per class, the math on 30,000 samples gets uncomfortable fast. And that’s before you account for the stroke data.&lt;/p&gt;

&lt;p&gt;The real problem isn’t just peak RAM usage. It’s that loading everything upfront means you can’t start training until loading finishes, the loaded arrays stay resident for the entire run, and any shuffle operation has to work over the full dataset in memory. All of this compounds.&lt;/p&gt;

&lt;p&gt;There’s also a subtlety with the stroke files: they come pre-split into train/val/test partitions. If you want to do your own splits (which you do, so you can control the ratio and the random seed), you need to recombine them first and re-split yourself.&lt;/p&gt;

&lt;p&gt;So before we get to the loader itself, there are three preprocessing steps to run.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Downloading the Data
&lt;/h3&gt;

&lt;p&gt;The download script fetches both file types from Google’s Cloud Storage. The listing endpoint returns XML, which the script parses to find the URLs for the classes you’ve defined in base_classes. Downloads run in parallel using a thread pool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;concurrent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;futures&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ThreadPoolExecutor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;download_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;download_folder&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;file_urls&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two separate calls — one for .npy raster files, one for .npz stroke files, filtered to the sketchrnn/ prefix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;download_quickdraw_files&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;file_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;npy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;download_quickdraw_files&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;file_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;npz&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix_filter&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sketchrnn/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Files that already exist are skipped, which matters when you’re running this on a remote server where connections drop and you have to restart.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F43gyiac4oc4mno5t0gbg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F43gyiac4oc4mno5t0gbg.png" alt="Screenshot: terminal output during download — the [DOWNLOAD] and [SKIP] lines showing parallel fetching" width="773" height="260"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: terminal output during download — the [DOWNLOAD] and [SKIP] lines showing parallel fetching&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 2: Recombining the Stroke Splits
&lt;/h3&gt;

&lt;p&gt;Each stroke .npz file has three keys: train, val, and test. Left as-is, you're working with a subset of the available data. The fix is to concatenate them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;concatenate&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;val&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;savez_compressed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs in parallel across all classes using ProcessPoolExecutor. One thing worth noting: after combining, gc.collect() is called explicitly. This is a multiprocessing context and Python's garbage collector doesn't always release memory between processes the way you'd expect. Without this, a machine with moderate RAM will start sweating as dozens of processes hold combined arrays simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: The Key Idea — One File Per Sample
&lt;/h3&gt;

&lt;p&gt;This is the decision everything else depends on.&lt;/p&gt;

&lt;p&gt;Instead of keeping each class as a single large .npy file, we explode every sample out into its own file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dataset/processed/
  images/
    cat/
      000001.npy ← shape: (28, 28, 1)
      000002.npy
      ...
  strokes/
    cat/
      000001.npy ← shape: (130, 3)
      000002.npy
      ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The conversion script loops over every class, loads the class-level arrays, preprocesses each sample, and saves them individually. The index is global across all classes — not per-class — which is what keeps image and stroke files aligned:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;global_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="n"&gt;max_samples_per_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100_000&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;label_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;LABEL_MAP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;images&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mmap_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# note: memory-mapped
&lt;/span&gt;    &lt;span class="n"&gt;strokes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;stroke_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;allow_pickle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latin1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
              &lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strokes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;max_samples_per_class&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;global_idx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
        &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;images/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;06&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.npy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
          &lt;span class="nf"&gt;preprocess_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
          &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;strokes/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;label_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;06&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.npy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
          &lt;span class="nf"&gt;preprocess_strokes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;global_idx&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The image loading uses mmap_mode="r" - memory-mapped, so NumPy doesn't load the entire (100000, 784) array into RAM just to iterate over it row by row. The preprocessing happens at this stage, not at training time, so the generator later is just doing file reads.&lt;/p&gt;

&lt;p&gt;This step takes a while to run. On the upside, it runs once.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffoeh8buwpzssev3fuhkk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffoeh8buwpzssev3fuhkk.png" alt="Screenshot: the processed/ directory structure — showing the per-class subdirectories with numbered .npy files" width="379" height="534"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: the processed/ directory structure — showing the per-class subdirectories with numbered .npy files&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  What Preprocessing Actually Does
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Images&lt;/strong&gt; are straightforward. Reshape (784,) to (28, 28), divide by 255 to get [0, 1] floats, expand the channel dimension:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;img&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;flat_img&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;255.0&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expand_dims&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# (28, 28, 1)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Strokes&lt;/strong&gt; are more involved. The raw data uses relative coordinates — each (dx, dy) is an offset from the previous point, not an absolute position. This makes sense for how drawings are recorded but not for how a model should see them. The preprocessing converts to absolute, centers the drawing at the origin, then scales to a fixed [-100, 100] range:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Relative -&amp;gt; absolute
&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cumsum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cumsum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Center at origin
&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Scale to [-100, 100]
&lt;/span&gt;&lt;span class="n"&gt;max_coord&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;max_coord&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="mf"&gt;100.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;max_coord&lt;/span&gt;
    &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*=&lt;/span&gt; &lt;span class="mf"&gt;100.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;max_coord&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stroke sequences are variable length. To get a fixed-size tensor for the LSTM, sequences are either truncated or zero-padded to 130 points. Why 130? Empirically, that covers the vast majority of drawings in the dataset without wasting too many zeros on the short ones.&lt;/p&gt;

&lt;p&gt;The pen state column (the third feature) is left as-is — it’s already a binary indicator of whether the pen is lifted.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzppujve678gc3qvsd81.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzppujve678gc3qvsd81.png" width="800" height="341"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: before/after visualization of a stroke — raw relative coordinates as a mess of lines, then the centered/normalized version looking like the actual drawing&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The Loader
&lt;/h3&gt;

&lt;p&gt;After preprocessing, the index step is fast. We walk the processed directory and collect all file paths:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;LABEL_MAP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;image_files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PROCESSED_DATA_DIR&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/images/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/*.npy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;stroke_files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;glob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PROCESSED_DATA_DIR&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/strokes/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cls&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/*.npy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_files&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stroke_files&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;SAMPLES_PER_CLASS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_files&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stroke_files&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point, images and strokes are just lists of strings. Nothing has been loaded into memory. The total dataset - 345 classes × 30,000 samples - indexes in a few seconds.&lt;/p&gt;

&lt;p&gt;There’s also a threshold in the config: IN_MEMORY_THRESHOLD = 30_000. If SAMPLES_PER_CLASS is below that number, the loader will actually call np.load() during indexing and store the arrays directly. For quick experiments on a subset of data, this avoids the per-sample I/O overhead at training time. For large runs, it streams from disk instead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;USE_IN_MEMORY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;USE_INDIVIDUAL&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SAMPLES_PER_CLASS&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;IN_MEMORY_THRESHOLD&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both paths feed into the same generator interface, which is a nice property — you can switch between them by changing one number.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Generator and the tf.data Pipeline
&lt;/h3&gt;

&lt;p&gt;The generator is a Python function that yields (image, stroke, one_hot_label) tuples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;data_generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;USE_IN_MEMORY&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stroke&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stroke&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;one_hot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NUM_CLASSES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;USE_INDIVIDUAL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;img_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;str_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;yield &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img_path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
                    &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;str_path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; 
                    &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;one_hot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;NUM_CLASSES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This feeds into a tf.data.Dataset via from_generator, which requires explicit output signatures - TensorFlow needs to know shapes and dtypes upfront since it can't infer them from a Python generator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;output_signature&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TensorSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TensorSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;130&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;TensorSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;NUM_CLASSES&lt;/span&gt;&lt;span class="p"&gt;,),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gen&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_signature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;output_signature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full pipeline adds shuffling (shuffles a buffer of 10× the batch size rather than the entire dataset), repeating, batching at 512, and prefetching:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;strokes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_shuffle&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_generator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gen&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_signature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;output_signature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_shuffle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shuffle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;repeat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;format_sample&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_parallel_calls&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AUTOTUNE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prefetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AUTOTUNE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The format_sample step reformats the yielded tuple into the dictionary format Keras expects for multi-input models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;format_sample&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stroke&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stroke_input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;stroke&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;img&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="n"&gt;label&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Shuffling indices, not files, is important here. The file layout on disk stays sequential — images for cat are in one directory, images for airplane in another. The shuffle happens in the data pipeline as it reads, which avoids random I/O seeks across the disk. Sequential reads are substantially faster than random ones, and the OS page cache will warm up the recently accessed files naturally.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzbhigzyiwseo9odc0vn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhzbhigzyiwseo9odc0vn.png" width="800" height="292"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: htop showing RAM usage during training — relatively flat, not growing with training time&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Splitting the Dataset
&lt;/h3&gt;

&lt;p&gt;The split is index-based. We shuffle a global index array once with a fixed seed, then slice it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shuffle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;train_end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;val_end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_end&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;train_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;train_end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;val_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;train_end&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;val_end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;test_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;val_end&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 80/10/10 ratio applies across all classes since the indexing step already interleaved everything. There’s no risk of a class being entirely in the training set and absent from validation.&lt;/p&gt;

&lt;p&gt;Validation and test datasets use .take() to consume a fixed number of batches - computed from the split sizes - since the generator repeats indefinitely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;val_ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;val_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val_strokes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val_labels&lt;/span&gt;
          &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ceil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;test_ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;test_images&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_strokes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_labels&lt;/span&gt;
          &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ceil&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_labels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What Went Wrong Along the Way
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;File count.&lt;/strong&gt; The processed dataset ends up with roughly 345 × 30,000 × 2 = 20.7 million files. Some filesystems handle this poorly. If you're on a filesystem with inode limits or slow directory listing (common with some HPC storage systems), the sorted(glob(...)) calls at index time can take several minutes. Structured subdirectories (one per class) help, but it's still a lot of files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Index alignment.&lt;/strong&gt; The global index scheme — where file names reflect position across all classes, not within a class — exists entirely to prevent a specific bug. An earlier version used per-class indices, which caused a silent alignment failure: image cat/000001.npy and stroke cat/000001.npy were always aligned, but after shuffling, the code was pulling from globally-indexed lists and the class-local numbering didn't correspond. The {idx:06d} naming ensures that whatever index you retrieve from the lists, the image and stroke file names will match.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training on a remote server with an unstable SSH connection.&lt;/strong&gt; The training history in the notebook has a gap. BackupAndRestore meant the model weights survived; the history object didn't. TensorBoard logs were the fallback, and the actual metrics are there - the notebook's loss and accuracy plots just show what was available from the Python history object after reconnecting. If you're doing long training runs remotely, save the history separately and frequently, not just at the end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory growth with TensorFlow’s GPU allocator.&lt;/strong&gt; By default, TensorFlow pre-allocates the entire GPU memory. For a machine shared with other users, or one running other processes, this is a problem. The fix is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;gpu&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;gpus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;experimental&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_memory_growth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gpu&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes TensorFlow allocate GPU memory incrementally as needed. It’s not set by default because it can slightly reduce performance in some scenarios, but for shared environments it’s basically always the right call.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I’d Do Differently
&lt;/h3&gt;

&lt;p&gt;The main thing I’d want to add is parallel file loading. Right now the generator is single-threaded — it loads one sample at a time, yields it, repeats. tf.data.AUTOTUNE on the prefetch helps by trying to keep the pipeline filled ahead of the model's consumption, but the actual I/O is sequential. Adding multiple generator workers (like PyTorch's num_workers) would reduce the time the GPU spends waiting for data.&lt;/p&gt;

&lt;p&gt;LMDB would also be worth experimenting with. The advantage over millions of small files is that it’s a single file that supports fast key-value lookup, sequential reading, and doesn’t suffer from filesystem overhead per-entry. The disadvantage is that it complicates the setup and makes debugging harder. For this project the small-files approach was fast enough, but at larger scale it would start to matter.&lt;/p&gt;

&lt;p&gt;A smarter caching strategy (keeping recently accessed samples in a bounded RAM buffer) would also help with the “warm up” problem. The first epoch is always slower than subsequent ones because the OS page cache starts cold. A pre-warmed in-memory buffer for the most frequently accessed samples would smooth that out.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Part That Surprised Me
&lt;/h3&gt;

&lt;p&gt;When I first sketched this out, my expectation was that disk-based loading would be noticeably slower than loading everything to RAM — enough to be a real bottleneck. It wasn’t, for a reason that only became clear after thinking about it: individual .npy file loads are fast. A (28, 28, 1) array at float32 is 3,136 bytes. A (130, 3) stroke array is 1,560 bytes. These are tiny files. The actual read time per sample is in the low microseconds, and the OS cache handles repeat accesses to recently-read files transparently.&lt;/p&gt;

&lt;p&gt;What you trade away compared to pure in-memory loading is predictability. With everything in RAM, access time is constant. With disk loading, you’re occasionally hitting a file that isn’t cached, and that read takes longer. In practice, the prefetch buffer absorbs most of this variance. The GPU never actually sat idle waiting for data in my runs — the bottleneck was always computation, not I/O.&lt;/p&gt;

&lt;p&gt;The other thing that surprised me was how much the single-file-per-class approach had been hiding. When everything for cat is one big (100000, 784) array, you have no choice but to load the whole thing before you can access any of it. That's a loading cost you pay every time. With individual files, you pay per sample - and you only pay for the samples you actually use.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Notebook Setup (in case it’s useful)
&lt;/h3&gt;

&lt;p&gt;One thing worth mentioning for anyone running this on a remote server: the port forwarding setup for Jupyter. If you’re SSH-ing into a machine and want to run notebooks rather than pulling .py files and running them in screen sessions, you forward the Jupyter port to localhost:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ssh"&gt;&lt;code&gt;&lt;span class="k"&gt;ssh&lt;/span&gt; -L &lt;span class="m"&gt;8888&lt;/span&gt;:localhost:8888 user@server_ip

&lt;span class="c1"&gt;# On the server:&lt;/span&gt;
&lt;span class="k"&gt;jupyter&lt;/span&gt; notebook --no-browser --port=8888
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you’re going through two layers of SSH (e.g. a department gateway server that routes to a compute node), you just carry the forwarding through:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ssh"&gt;&lt;code&gt;&lt;span class="k"&gt;ssh&lt;/span&gt; -L &lt;span class="m"&gt;8888&lt;/span&gt;:localhost:8888 user@gateway
&lt;span class="c1"&gt;# On gateway:&lt;/span&gt;
&lt;span class="k"&gt;ssh&lt;/span&gt; -L &lt;span class="m"&gt;8888&lt;/span&gt;:localhost:8888 user@compute_node
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And for full control over Python and library versions, running the kernel inside a virtual environment is worth the setup time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3.12 &lt;span class="nt"&gt;-m&lt;/span&gt; venv .tens
&lt;span class="nb"&gt;source&lt;/span&gt; .tens/bin/activate
pip &lt;span class="nb"&gt;install &lt;/span&gt;jupyter ipykernel tensorflow numpy tqdm matplotlib
python &lt;span class="nt"&gt;-m&lt;/span&gt; ipykernel &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--user&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;.tens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then you can select .tens as the kernel in Jupyter and know exactly what Python version and library versions are running - which matters if you're planning to later quantize the model and deploy it somewhere like a Raspberry Pi, where the environment constraints are much stricter.&lt;/p&gt;

&lt;p&gt;The pipeline ended up being more engineered than I originally wanted. But it runs, it doesn’t crash, and it’ll scale to more classes or more samples per class without changes. For a dataset this size on a memory-constrained machine, that’s the bar.&lt;/p&gt;

&lt;p&gt;The code is all in the &lt;a href="https://github.com/YuvrajRaghuvanshiS/doodle-vision" rel="noopener noreferrer"&gt;repository&lt;/a&gt; if you want to look at the actual implementation rather than the edited excerpts here. I’ll make this public once the paper is accepted.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is rewritten using AI chatbots.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;April 14, 2026&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Reverse Engineering SmartLock by Parivahan: What I Found Inside a Python Proctoring App</title>
      <dc:creator>Yuvraj Raghuvanshi</dc:creator>
      <pubDate>Tue, 07 Apr 2026 12:14:33 +0000</pubDate>
      <link>https://dev.to/yuvrajraghuvanshis/reverse-engineering-smartlock-by-parivahan-what-i-found-inside-a-python-proctoring-app-3oda</link>
      <guid>https://dev.to/yuvrajraghuvanshis/reverse-engineering-smartlock-by-parivahan-what-i-found-inside-a-python-proctoring-app-3oda</guid>
      <description>&lt;p&gt;I didn’t plan to reverse engineer a proctoring application. I just wanted to understand why a page kept refreshing in an infinite loop.&lt;/p&gt;

&lt;p&gt;That one puzzling symptom ended up pulling me down a rabbit hole that took days to climb out of — involving PyInstaller internals, broken decompilers, browser automation quirks, and a race condition that convincingly pretended to be tamper detection. The journey was longer than I expected, and honestly more interesting. So I figured I might as well write it up.&lt;/p&gt;

&lt;p&gt;The application in question is &lt;strong&gt;SmartLock by Parivahan&lt;/strong&gt; , a proctoring system used for government driver’s learning license exams in India, and yes, I am getting a driving license at the age of 25. We all start somewhere.&lt;/p&gt;

&lt;p&gt;It’s a Python desktop app that locks down your machine, watches your screen, monitors USB ports, controls your browser, and talks to a remote server — all at the same time. Understanding how it does all of this, and how the pieces fit together, is what this article is about.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 1: Opening the Box
&lt;/h3&gt;

&lt;p&gt;The first thing I did was look at the installation directory. This usually tells you a lot before you write a single line of code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;_internal/
browser/
config/
log/
Pictures/
Smartlock.exe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That _internal/ folder was the giveaway. It's a classic PyInstaller signature. The folder contains Python libraries and compiled bytecode - essentially, a self-contained Python runtime bundled into a single executable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdfaenv29sva28s1ckkk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdfaenv29sva28s1ckkk.png" alt="Screenshot: Installation directory structure showing _internal/ folder and Smartlock.exe" width="373" height="266"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: Installation directory structure showing _internal/ folder and Smartlock.exe&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So the application was built in Python and packaged using PyInstaller. That meant extraction was possible using &lt;a href="https://github.com/extremecoders-re/pyinstxtractor" rel="noopener noreferrer"&gt;pyinstxtractor&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python pyinstxtractor.py Smartlock.exe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After extraction, the structure looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Smartlock.exe_extracted/
├── PYZ-00.pyz_extracted/
│ ├── asyncio/
│ ├── psutil/
│ ├── pydivert/
│ ├── selenium/
│ ├── websockets/
│ ├── win32com/
│ ├── yaml/
│ ├── controller.pyc
│ ├── registry_edit.pyc
│ └── ...
├── core.pyc
└── ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most of what you see here is noise — third-party libraries. The signal is in the handful of .pyc files: core.pyc, controller.pyc, registry_edit.pyc. These contain the actual application logic. Everything else is plumbing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 2: Decompilation and Why It’s Always Messier Than It Sounds
&lt;/h3&gt;

&lt;p&gt;This is where things got annoying.&lt;/p&gt;

&lt;p&gt;I tried the standard tools first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uncompyle6 core.pyc
decompyle3 core.pyc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both failed. Version mismatch — the bytecode was compiled with a Python version these tools didn’t fully support. I eventually got somewhere using &lt;a href="https://pychaos.io/" rel="noopener noreferrer"&gt;pychaos&lt;/a&gt;, but I want to be honest about what “decompiled code” actually looks like in practice. It’s not clean. Comments are gone (they’re never stored in bytecode). Control flow gets reconstructed heuristically and is often wrong. You get artifacts like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;__CHAOS_PY_TEST_NOT_INIT_ERR__&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the actual code probably looked something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The decompiler is doing its best, but it’s guessing. Reverse engineering at this level is less about reading code and more about reconstructing intent from imperfect evidence. You develop a feel for what the code is &lt;em&gt;trying&lt;/em&gt; to do, even when the syntax is broken.&lt;/p&gt;

&lt;p&gt;The three key files broke down roughly like this:&lt;/p&gt;

&lt;p&gt;core.pyc Main orchestrator - startup, thread management&lt;/p&gt;

&lt;p&gt;controller.pyc Enforcement logic - monitoring, detection&lt;/p&gt;

&lt;p&gt;registry_edit.pyc OS-level restrictions - registry modifications&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 3: Reconstructing the Architecture
&lt;/h3&gt;

&lt;p&gt;Once I had a working (if imperfect) picture of the code, the overall architecture became clear:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsds92c40ok5jb8swdii9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsds92c40ok5jb8swdii9.png" width="800" height="978"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: detailed diagram showcasing the connections between SmartLock application, bundled Chrome application, and remote server&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It’s a tightly coupled system. The desktop app and the browser aren’t independent — they’re in constant communication. And both of them are talking to the remote server. Remove any one of these connections and the whole thing breaks.&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 4: The Browser That Kept Redirecting
&lt;/h3&gt;

&lt;p&gt;After reconstructing and running the application, I ran into something strange.&lt;/p&gt;

&lt;p&gt;The exam webpage was redirecting continuously to a 403.jsp webpage and in a split second back to exam login webpage. Every few seconds — reload, reload, reload. My first instinct was that this was intentional tamper detection. After all, the whole point of a proctoring system is to detect when something isn’t right. Maybe it had detected something about my environment and was punishing me with an infinite loop.&lt;/p&gt;

&lt;p&gt;That turned out to be wrong. But figuring out &lt;em&gt;why&lt;/em&gt; it was wrong took a while.&lt;/p&gt;

&lt;p&gt;The browser bundled with SmartLock isn’t a standard Chrome installation. It’s a portable Chromium build with a preconfigured user profile:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;browser/
├── App/
│ └── Chrome-bin/
├── Data/
│ └── profile/
│ └── Default/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I inspected the stored cookies, session tokens, cached scripts, and extensions looking for some kind of tamper detection artifact. Nothing useful. The refreshing wasn’t coming from stored state.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7oxw3343h3j7ky19tqjn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7oxw3343h3j7ky19tqjn.png" width="378" height="389"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: browser/ directory structure&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 5: SmartSocket.js — Where It All Connected
&lt;/h3&gt;

&lt;p&gt;The actual cause was in the exam webpage itself. Buried in the page source was this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;script &lt;/span&gt;&lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"SmartLock/SmartSocket.js"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This script establishes a WebSocket connection to the local application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;socket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;WebSocket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ws://localhost:8000/&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the critical link. The browser doesn’t just display the exam — it actively depends on the local application being alive and reachable. As soon as the connection is established, the browser authenticates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;reqOb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Authentication&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nx"&gt;reqOb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;1234&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nx"&gt;reqOb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;appl_no&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nx"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;reqOb&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And if the connection fails — even momentarily — the page clears the session and reloads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// On connection failure:&lt;/span&gt;
&lt;span class="c1"&gt;// → Clear session&lt;/span&gt;
&lt;span class="c1"&gt;// → Redirect or reload&lt;/span&gt;
&lt;span class="c1"&gt;// → UI resets to initial state&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not tamper detection. It’s a strict runtime dependency. The browser requires the local WebSocket server to be up &lt;em&gt;before&lt;/em&gt; it finishes loading. If it isn’t, you get an infinite reload loop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F29jg4lht5kvh2be0sfge.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F29jg4lht5kvh2be0sfge.png" alt="Screenshot: SmartSocket.js connection code in browser devtools" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: SmartSocket.js connection code in browser devtools&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 6: The Race Condition
&lt;/h3&gt;

&lt;p&gt;With that understanding, the root cause became obvious. The application starts the WebSocket server and the browser in parallel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;thread1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;start_websocket&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;thread2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;launch_browser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem with this is that “starting” the WebSocket server takes a moment. The browser, however, is fast — it loads the page, runs the script, and tries to connect to localhost:8000 before the server is actually ready. Connection fails. Page reloads. Tries again. Same thing. Infinite loop.&lt;/p&gt;

&lt;p&gt;The sequence of events looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser loads page
 → SmartSocket.js executes immediately
 → Attempts WebSocket connection to localhost:8000
 → Server not ready yet
 → Connection refused
 → Page session cleared
 → Page reloads
 → Same thing happens again
 → ...forever
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It perfectly mimicked tamper detection behavior, which is why I assumed that’s what it was. But it was just a timing issue.&lt;/p&gt;

&lt;p&gt;The fix is simple: wait for the server to be ready before launching the browser.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;browser_start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sock&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_connection&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8000&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;OSError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Now launch browser
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this in place, the correct sequence is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Start App
 → Start WebSocket server
 → Poll until server is accepting connections
 → Launch browser
 → WebSocket connects successfully
 → Authentication succeeds
 → Exam proceeds normally
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj14za7htls7g21ap02a2.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj14za7htls7g21ap02a2.gif" width="600" height="338"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: Before — infinite reload loop vs successful startup&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 7: What the App Is Actually Doing Under the Hood
&lt;/h3&gt;

&lt;p&gt;Once the startup problem was solved, I could look more carefully at all the enforcement mechanisms running in the background. There’s quite a lot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OS-Level Lockdown&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The registry editor modifies Windows to disable the usual escape routes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="py"&gt;DisableTaskMgr&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;1&lt;/span&gt;
&lt;span class="py"&gt;DisableLockWorkstation&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;1&lt;/span&gt;
&lt;span class="py"&gt;NoLogoff&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This disables Task Manager, the lock screen, and the ability to log off. The Ctrl+Alt+Del menu effectively becomes useless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Process Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The monitoring engine maintains a list of software that shouldn’t be running during an exam. Screen recording and virtual camera tools are specifically targeted:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OBS Studio&lt;/li&gt;
&lt;li&gt;ManyCam&lt;/li&gt;
&lt;li&gt;XSplit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of these processes are detected, a violation is flagged.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;USB Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The app takes a snapshot of connected USB devices at startup and watches for changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_usb&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;initial_usb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;flag_violation&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plugging in a USB drive during the exam is treated as a potential integrity violation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-Monitor and VM Detection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Multiple displays are blocked. The app also checks whether it’s running inside a virtual machine — which would make it easier to manipulate the environment without being detected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network Filtering via pydivert&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This was the most interesting piece. The app uses pydivert - a Python wrapper around WinDivert - to implement packet-level network filtering. During an exam, only certain destinations are allowed. Everything else is dropped at the kernel level.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 8: Two WebSockets, Not One
&lt;/h3&gt;

&lt;p&gt;I initially assumed there was a single WebSocket connection: browser to local app. There are actually two:&lt;/p&gt;

&lt;p&gt;Local WebSocket (localhost:8000) Browser ↔ Desktop App&lt;/p&gt;

&lt;p&gt;Remote WebSocket Desktop App ↔ Remote Server&lt;/p&gt;

&lt;p&gt;The local one handles session management for the browser. The remote one is for continuous telemetry — the app regularly sends status updates to the server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"USB"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ProcessCheck"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"detected"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server isn’t passive. It’s continuously validating that the client is behaving correctly. If the telemetry stops or reports a violation, the server can terminate the session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All the connection details live in a YAML config file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;ExamIp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;164.100.69.5&lt;/span&gt;
&lt;span class="py"&gt;ExamUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://sarathi.parivahan.gov.in/sarathiservice/authenticationaction.do?authtype=Anugyna&lt;/span&gt;
&lt;span class="py"&gt;SocketPort&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;8000&lt;/span&gt;
&lt;span class="py"&gt;SocketServerPort&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3000&lt;/span&gt;
&lt;span class="py"&gt;SocketUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ws://sarathi.parivahan.gov.in&lt;/span&gt;
&lt;span class="py"&gt;StatusApiUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://sarathi.parivahan.gov.in/sarathiWS/rsServices/smartLockCheck/smartLockCheck&lt;/span&gt;
&lt;span class="py"&gt;ViolationApiUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://sarathicov.nic.in:8443/sarathiWS/rsServices/smartLockCheck/examViolation&lt;/span&gt;
&lt;span class="py"&gt;primaryServerIPV4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.172.31.33&lt;/span&gt;
&lt;span class="py"&gt;primaryServerIPV6&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2001:4408:7204:8:5d93:8239:8876:d238&lt;/span&gt;
&lt;span class="py"&gt;secondaryServerIPV4&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.172.31.30&lt;/span&gt;
&lt;span class="py"&gt;secondaryServerIPV6&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2001:4408:7204:9::aac:2033&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means the behavior is somewhat server-controlled. The exam URL, the socket address, the session logic — it’s all configured externally, which makes the server the real authority over how the session runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 9: The Firewall
&lt;/h3&gt;

&lt;p&gt;One more thing worth mentioning: the app appears to interact with the Windows Firewall directly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;fw&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pbox_fw_backup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pbox_bkp.wfw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This exports the current firewall rules to a backup file before modifying them. The behavior is that the app replaces your firewall rules with its own restricted ruleset for the duration of the exam, then restores the backup afterward. It may also hash the backup to detect if someone has tampered with the rules mid-session.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase 10: Why You Can’t Just Rebuild It
&lt;/h3&gt;

&lt;p&gt;One last I want to address directly: extracting the PyInstaller binary does not give you a working copy of the application. There’s a common misconception that extraction = reconstruction. It doesn’t.&lt;/p&gt;

&lt;p&gt;The workflow for actually rebuilding would be:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Decompile .pyc files to .py&lt;/li&gt;
&lt;li&gt;Manually correct the decompilation errors&lt;/li&gt;
&lt;li&gt;Rebuild with PyInstaller&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Steps 1 and 2 are where it falls apart in practice. The decompiled code has inaccuracies that aren’t always obvious. Some of the control flow is wrong in subtle ways that only become apparent at runtime. There are also timing dependencies baked into the threading model, and the server-side validation means you’d need a cooperating server to test anything properly.&lt;/p&gt;

&lt;p&gt;The security here doesn’t come from any single mechanism being unbreakable. It comes from the combination:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The local app monitors the OS&lt;/li&gt;
&lt;li&gt;The browser depends on the local app&lt;/li&gt;
&lt;li&gt;The server monitors the local app&lt;/li&gt;
&lt;li&gt;The network is filtered at the kernel level&lt;/li&gt;
&lt;li&gt;The firewall is replaced during the session&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each layer on its own is probably defeatable. Together, they create a system where defeating one layer doesn’t help much because the others remain intact.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I Took Away From This
&lt;/h3&gt;

&lt;p&gt;A few things stuck with me after this investigation.&lt;/p&gt;

&lt;p&gt;The infinite reload loop was genuinely convincing as tamper detection. I spent more time than I’d like to admit looking for a security mechanism that wasn’t there. The lesson is that emergent behavior from a race condition can look exactly like intentional defensive behavior. Don’t assume intent before you’ve traced the actual execution path.&lt;/p&gt;

&lt;p&gt;The browser is doing real security work here, not just displaying a UI. SmartSocket.js is the link between the exam session and the local enforcement system. If that connection breaks, the exam can't proceed. That's a deliberate architectural choice, not an accident.&lt;/p&gt;

&lt;p&gt;And PyInstaller extraction, while possible, is just the beginning. The hard part isn’t getting the bytecode out. It’s making sense of what the decompiler gives you and reconstructing what the developer originally meant to write.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Could Come Next
&lt;/h3&gt;

&lt;p&gt;If I were going to continue this investigation, the natural directions would be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mapping the full WebSocket protocol between browser and app, and between app and server&lt;/li&gt;
&lt;li&gt;Tracing the telemetry payloads to understand exactly what data gets sent and when&lt;/li&gt;
&lt;li&gt;Building a sequence diagram for the full session lifecycle, from startup through exam completion&lt;/li&gt;
&lt;li&gt;Looking more carefully at the firewall manipulation and how (or whether) it detects tampering with the backed-up rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of that fits in one article. But the architecture is now clear enough that any of those threads could be pulled independently.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwz0dbz8bvw1ackgxjbh3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwz0dbz8bvw1ackgxjbh3.png" alt="Screenshot: Final — application running normally with exam loaded, WebSocket connection established" width="800" height="450"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Screenshot: Final — application running normally with exam loaded, WebSocket connection established&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is for educational purposes — understanding how production security systems are architected and why they’re difficult to tamper with. The focus throughout has been on the design and behavior of the system, not on defeating it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is rewritten using AI chatbots.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;April 07, 2026&lt;/em&gt;&lt;/p&gt;

</description>
      <category>websocket</category>
      <category>python</category>
      <category>reverseengineering</category>
    </item>
  </channel>
</rss>
