DEV Community: Yuvraj Raghuvanshi

Counting a Billion Things With 1.5 Kilobytes

Yuvraj Raghuvanshi — Fri, 15 May 2026 09:35:52 +0000

Here’s a problem that sounds trivial until you think about it for a moment.

You have a stream of data, user IDs hitting your API, search queries, IP addresses, product views. The stream is enormous: hundreds of millions of events per day. Someone asks: how many unique users did we serve today?

The naive answer is a set. Throw every user ID into a set, count its size at the end of the day. This is correct. It is also catastrophically expensive. A set of 100 million 64-bit integers takes roughly 800 megabytes of memory. Scale to a billion users and you’re at 8 gigabytes, just for one counter, just for one day. Redis running PFCOUNT on a key with a billion unique members would consume none of that. It uses 1.5 kilobytes.

That number is not a typo. The algorithm behind it is called HyperLogLog, it was invented by a French mathematician named Philippe Flajolet in 2007, and it is one of the most satisfying things I’ve encountered in a while. Let’s build it from scratch.

The Problem Has a Name

This is called the count-distinct problem , or cardinality estimation. You have a multiset (a collection where elements can repeat) and you want to know how many distinct elements it contains, without storing all of them.

The exact answer always requires memory proportional to the number of distinct elements. There’s no way around that. But if you’re willing to accept a small, predictable error (say, within 2% of the true count) you can do something remarkable: use memory proportional to the logarithm of the logarithm of the count. That’s where the “LogLog” in HyperLogLog comes from. It’s not a marketing name. It’s a description of the memory complexity.

To understand how this is possible, we need to start with a coin flip.

The Coin Flip Observation

Imagine flipping a fair coin repeatedly until you get heads, and recording how many tails you saw before the first heads. Call that number k.

If k = 0, you got heads immediately. Probability: 1/2. If k = 1, you got one tail then heads. Probability: 1/4. If k = 2, two tails then heads. Probability: 1/8.

The probability of seeing k leading tails before the first heads is 1 / 2^(k+1). Which means: if the longest run of leading tails you've ever seen across many experiments is k, you've probably run roughly 2^k experiments.

See where this is going?

If we hash every element in our stream to a sequence of bits, those bits behave like coin flips, roughly half start with 0, a quarter start with 00, an eighth start with 000. If the longest run of leading zeros we’ve seen in any hash is k, our estimate for how many distinct elements we've processed is 2^k.

Let’s check the intuition with real numbers. Say we have 1,000 distinct elements. The probability that at least one of them hashes to a value starting with 10 leading zeros is very high, because 2^10 = 1024 ≈ 1000. The probability that any of them hashes to 20 leading zeros is astronomically low (2^20 = 1,048,576) much larger than our actual count.

So the maximum run of leading zeros is a surprisingly good estimator of the order of magnitude of distinct elements seen. The key insight is that we only need to store one number (the current maximum) regardless of how many elements we’ve seen.

Here’s the simplest possible version:

# util.py
import hashlib

def count_leading_zeros(hash_bits: int, bit_length: int = 32) -> int:
    """Count leading zero bits in a hash value."""
    if hash_bits == 0:
        return bit_length
    count = 0
    for i in range(bit_length - 1, -1, -1):
        if (hash_bits >> i) & 1:
            break
        count += 1
    return count

def hash_element(element: str) -> int:
    """Hash an element to a 32-bit integer."""
    h = hashlib.md5(element.encode()).hexdigest()
    return int(h[:8], 16) # Take first 32 bits

# naive.py
from util import count_leading_zeros, hash_element

class NaiveEstimator:
    def __init__ (self):
        self.max_leading_zeros = 0

    def add(self, element: str):
        h = hash_element(element)
        k = count_leading_zeros(h)
        self.max_leading_zeros = max(self.max_leading_zeros, k)

    def estimate(self) -> int:
        return 2**self.max_leading_zeros

Let’s test it:

# test.py
import random
import string

from naive import NaiveEstimator

def random_id():
    return "".join(random.choices(string.ascii_lowercase, k=10))

naive_estimator = NaiveEstimator()
true_set = set()

for _ in range(100_000):
    elem = random_id()
    naive_estimator.add(elem)
    true_set.add(elem)

print(f"True count: {len(true_set)}\n")

est = naive_estimator.estimate()
error = abs(est - len(true_set)) / len(true_set) * 100

print(f"[naive] \tEstimated count: {est}")
print(f"[naive] \tError: {error:.2f}%\n")

Running this a few times gives results like:

True count: 100000
[naive] Estimated count: 32768 ← 2^15
[naive] Error: 67.23%

True count: 100000
[naive] Estimated count: 65536 ← 2^16
[naive] Error: 34.46%

It’s correct to within an order of magnitude, consistently. But notice the problem: the estimate can only ever be a power of 2. It jumps from 32,768 to 65,536 with no values in between. The estimate has extremely high variance, a single unlucky hash producing extra leading zeros throws everything off. This is the Flajolet-Martin algorithm from 1984. It works, but it’s rough.

Reducing Variance: Buckets

The solution to high variance is the same as it always is in statistics: take more samples and average them.

One approach: run multiple independent hash functions and average the results. But hashing everything multiple times is expensive.

A smarter approach, from Flajolet and Durand’s 2003 LogLog paper: use a single hash, but split it into two parts. Use the first b bits to choose a bucket (one of m = 2^b buckets), and run the leading-zero counter on the remaining bits.

Each bucket independently estimates the cardinality from the subset of elements that landed in it. We then combine the buckets by averaging. We’ve effectively gotten m independent estimates from a single hash function.

# loglog.py
from util import count_leading_zeros, hash_element

class LogLogEstimator:
    def __init__ (self, b: int = 8):
        """
        b: number of bits used for bucket index
        m: number of buckets = 2^b
        """
        self.b = b
        self.m = 2**b
        self.buckets = [0] * self.m

    def add(self, element: str):
        h = hash_element(element)

        # First b bits → bucket index
        bucket_index = h >> (32 - self.b)

        # Remaining 32-b bits → count leading zeros
        remaining = h & ((1 << (32 - self.b)) - 1)
        leading_zeros = count_leading_zeros(remaining, 32 - self.b)

        # Keep the maximum for this bucket
        self.buckets[bucket_index] = max(
            self.buckets[bucket_index],
            leading_zeros,
        )

    def estimate(self) -> float:
        # LogLog: geometric mean across buckets
        avg = sum(self.buckets) / self.m
        return self.m * (2**avg)

With b = 8, we have 256 buckets, each storing one small integer (max ~32). Total memory: 256 bytes. And the estimates are dramatically more stable than the naive version. The standard error of LogLog is 1.3 / sqrt(m), with 256 buckets, that's about 8%.

But we can do better. Still in the same 2003 paper, Flajolet noticed that the largest bucket values are outliers that inflate the estimate. He suggested keeping only the bottom 70% of bucket values for the average. This is SuperLogLog , and it reduces the error to 1.05 / sqrt(m), about 6.5% with 256 buckets, with no memory increase.

Then came 2007.

The Harmonic Mean: The HyperLogLog Insight

The jump from SuperLogLog to HyperLogLog is a single change: replace the geometric mean with the harmonic mean.

The harmonic mean of a set of values is:

n / (1/v1 + 1/v2 + ... + 1/vn)

Why does this help? The harmonic mean is less sensitive to large outliers than the geometric mean. When one bucket has seen an unusually large leading-zero count, because one element happened to hash to something that starts with twelve zeros, that bucket’s contribution to the harmonic mean is 1/2^12, which is tiny. It barely moves the needle. The geometric mean (averaging the exponents) would give it far more weight.

The HyperLogLog estimate is:

estimate = correction_factor * m^2 * 1 / sum(2^(-bucket[i]) for all i)

Where correction_factor is a constant that depends on m (approximately 0.7213 for large m), and the whole formula is just the harmonic mean of 2^bucket[i] values, scaled.

# hyperloglog.py
import math

from util import count_leading_zeros, hash_element

# Correction factors per the original paper
CORRECTION_FACTORS = {
    16: 0.673,
    32: 0.697,
    64: 0.709,
}

class HyperLogLog:
    def __init__ (self, b: int = 8):
        """
        b: number of index bits (between 4 and 16)
        m = 2^b buckets
        Standard error ≈ 1.04 / sqrt(m)
        """
        assert 4 <= b <= 16, "b must be between 4 and 16"
        self.b = b
        self.m = 2**b
        self.buckets = [0] * self.m

        if self.m in CORRECTION_FACTORS:
            self.alpha = CORRECTION_FACTORS[self.m]
        else:
            # for m >= 128
            self.alpha = 0.7213 / (1 + 1.079 / self.m)

    def add(self, element: str):
        h = hash_element(element)

        # First b bits → bucket index
        bucket_index = h >> (32 - self.b)

        # Remaining bits → position of leftmost 1-bit
        # (= 1 + number of leading zeros in remaining bits)
        remaining = h & ((1 << (32 - self.b)) - 1)
        rho = count_leading_zeros(remaining, 32 - self.b) + 1

        self.buckets[bucket_index] = max(self.buckets[bucket_index], rho)

    def estimate(self) -> float:
        # Harmonic mean of 2^bucket values
        harmonic_sum = sum(2.0 ** (-b) for b in self.buckets)
        raw_estimate = self.alpha * (self.m**2) / harmonic_sum

        # Small range correction: use LinearCounting when estimate is small
        if raw_estimate <= 2.5 * self.m:
            zeros = self.buckets.count(0)
            if zeros > 0:
                return self.m * math.log(self.m / zeros)

        # Large range correction (for 32-bit hashes)
        if raw_estimate > (1.0 / 30.0) * (2**32):
            return -(2**32) * math.log(1 - raw_estimate / (2**32))

        return raw_estimate

Two corrections are applied at the edges:

Small range correction : When the estimate is less than 2.5x the number of buckets, many buckets are still empty (zero). In that regime, a separate algorithm called LinearCounting is more accurate. LinearCounting uses the number of empty buckets: m * ln(m / empty_buckets). HyperLogLog switches to it automatically.

Large range correction : When the estimate approaches the maximum value representable by a 32-bit hash (about 4 billion), hash collisions start causing systematic undercount. A logarithmic correction compensates.

Now let’s test this properly:

# test.py
import random
import string

from hyperloglog import HyperLogLog

def random_id():
    return "".join(random.choices(string.ascii_lowercase, k=10))

hyper_log_log = HyperLogLog(b=10) # 1024 buckets, ~1KB
true_set = set()

for _ in range(100_000):
    elem = random_id()
    hyper_log_log.add(elem)
    true_set.add(elem)

print(f"True count: {len(true_set)}\n")

est = hyper_log_log.estimate()
error = abs(est - len(true_set)) / len(true_set) * 100

print(f"[hyper log log] \tEstimated count: {est:,.0f}")
print(f"[hyper log log] \tError: {error:.2f}%")
print(f"[hyper log log] \tMemory (approx): {hyper_log_log.m} bytes for buckets")

Output:

True count: 100000

[hyper log log] Estimated count: 97,291
[hyper log log] Error: 2.71%
[hyper log log] Memory (approx): 1024 bytes for buckets

1024 bytes. About 2% error in this particular run. On 100,000 unique elements.

With b = 10 (1024 buckets), the theoretical standard error is 1.04 / sqrt(1024) = 3.25%. In practice it often lands well within that. With b = 12 (4096 buckets, still only 4KB), the standard error drops to 1.04 / sqrt(4096) = 1.6%.

The Memory Arithmetic

Let’s be precise about why this is so small.

Each bucket stores the maximum run of leading zeros it’s seen. With a 32-bit hash and b bits used for the index, the remaining 32 - b bits are used for the zero count. The maximum possible count is 32 - b, which for b = 10 is 22. You need 5 bits to represent numbers up to 22.

So each bucket needs 5 bits. With 1024 buckets: 1024 * 5 = 5120 bits = 640 bytes.

In practice Redis uses 6 bits per register and 16,384 registers (b = 14): 16384 * 6 = 98,304 bits = 12,288 bytes = 12KB. With this configuration the standard error is 1.04 / sqrt(16384) = 0.81%. Under 1% error, for a 12KB data structure, counting up to billions of distinct elements.

Redis also uses a dense/sparse encoding: when few elements have been added, only the non-zero buckets are stored. The 12KB limit is only reached with a large number of distinct elements.

The Property That Makes It Production-Useful

There’s one more thing that makes HyperLogLog more than just a clever approximation: HyperLogLog sketches are mergeable.

If you have a HyperLogLog for Monday’s traffic and one for Tuesday’s traffic, you can merge them into a HyperLogLog for Monday+Tuesday’s combined unique users by taking the element-wise maximum of the two bucket arrays:

def merge(hll1: HyperLogLog, hll2: HyperLogLog) -> HyperLogLog:
    assert hll1.b == hll2.b, "Can only merge HLLs with same b"
    merged = HyperLogLog(hll1.b)
    merged.buckets = [
        max(b1, b2)
        for b1, b2 in zip(hll1.buckets, hll2.buckets)
    ]
    return merged

This takes O(m) time and produces an estimate with the same error guarantee as a fresh HyperLogLog that had seen both datasets. No re-processing. No storing the original data.

This is why it appears in distributed systems everywhere. You can compute HyperLogLog sketches independently on shards, machines, or time windows, then merge them instantly. Reddit uses this for per-post unique view counts distributed across servers. BigQuery uses it for APPROX_COUNT_DISTINCT(). Facebook Presto, Apache Druid, Amazon Redshift - they all have it.

In Redis, PFADD adds elements and PFMERGE merges two HLLs. The PF prefix is a tribute to Philippe Flajolet, who died in 2011, four years after publishing the paper that named the algorithm.

What I Actually Took From This

There’s a category of algorithms where the core idea is so simple that it seems like it can’t possibly work, and the whole experience of learning it is going from skepticism to surprise to understanding. Counting distinct elements by tracking coin-flip statistics is in that category.

The thing I keep thinking about is the memory arithmetic. A set() in Python holding 100,000 integers uses roughly 4MB. The HyperLogLog above used 1KB and got within 2%. The set uses 4,000x more memory for exact precision. Whether that precision is worth 4,000x the memory depends entirely on what you're building.

For a unique user counter that needs to answer in real time, served from Redis, on billions of events, it’s not worth it. For a billing system that needs to charge per unique user to the cent, it might be. Knowing which situation you’re in is the actual engineering skill. HyperLogLog is just a tool. A very elegant one.

This article is rewritten using AI chatbots.

I Had to Run a Blockchain on My Laptop, So I Put It in Kubernetes

Yuvraj Raghuvanshi — Wed, 06 May 2026 06:38:06 +0000

The assignment was to build a ticket booking system using Hyperledger Fabric. Two entities (travel agencies and customers) a shared ledger, and a requirement that every booking be verifiable on the blockchain. We had weeks to do it.

I did not plan to spend the first few weeks fighting infrastructure before writing a single line of business logic.

What Hyperledger Fabric Actually Is

Before I get into what went wrong, it’s worth explaining what Hyperledger Fabric is, because it’s quite different from what most people imagine when they hear “blockchain.”

When people think of blockchain, they usually think of Bitcoin or Ethereum: a public network anyone can join, where transactions are anonymous, and where consensus is reached through computational work. Hyperledger Fabric is none of those things. It’s a permissioned blockchain framework — every participant must be explicitly identified and credentialed before they can interact with the network. There are no anonymous transactions. There is no mining.

Fabric’s target audience is enterprises and consortiums. Think of a group of banks that want to record inter-bank settlements on a shared ledger, or airlines and travel agencies that want a single source of truth for ticket inventory. Each of those organizations runs their own nodes, retains control of their own data, and collectively agrees on what gets written to the ledger. No one organization controls the chain.

The core building blocks of a Fabric network are:

Peers : The actual nodes that store a copy of the ledger and execute the smart contracts (called “chaincode” in Fabric). Each organization in the network runs one or more peers. If you have two organizations, you have at least two sets of peers.

Orderers : A separate cluster of nodes whose only job is to sequence transactions and package them into blocks. Peers don’t talk to each other to agree on order; they send transactions to the orderers, who handle that. The orderers use a consensus algorithm called Raft — the same one used in databases like etcd — where one node is elected leader and the others follow.

Certificate Authorities (CAs): Since every participant must be credentialed, Fabric runs a CA for each organization. These issue the cryptographic identities (X.509 certificates) that peers, orderers, and users present when making any request. No valid certificate, no access.

Channels : A Fabric network can have multiple independent sub-ledgers called channels. Each channel has its own blockchain, its own set of members, and its own chaincode. In this project there’s one channel: mychannel.

Chaincode : The smart contracts. These are programs that run on the peers and define what operations can be performed on the ledger. In Fabric, chaincode is written in a real programming language (Go, Java, TypeScript). When a client wants to record a booking, it calls a chaincode function. The chaincode executes on the peer, validates the inputs, and writes to the ledger.

World State : The current state of all data, stored in a database (CouchDB in this project). When chaincode writes data, it goes into the world state. The blockchain itself records the history of every transaction; the world state is the up-to-date snapshot.

For this project, the network has three organizations. Org0 runs the ordering service — three orderer nodes using Raft consensus. Org1 represents travel agencies. Org2 represents customers. Each org has two peers and one CA.

Diagram: Three boxes labeled Org0 (3 orderers), Org1 (2 peers + CA), Org2 (2 peers + CA), connected by a channel labeled mychannel

The Problem With “Simple”

Fabric ships with several example networks. The simplest is a Docker Compose setup that brings up a few containers on your local machine. I started there.

It didn’t connect reliably. Peers couldn’t reach each other. The REST API sample couldn’t find the ledger. I tried the JavaScript version. Same issues. The Docker Compose approach works fine if you follow the tutorial exactly on a clean machine with the right Fabric binary versions. In practice, when you’re trying to connect your own code to it rather than running the provided samples, small mismatches in TLS configuration or service discovery cause silent failures that are difficult to trace.

The Kubernetes-based test network (test-network-k8s in the fabric-samples repository) was the only variant that worked consistently. And once I committed to it, it solved a second problem I hadn't fully thought through: I needed to run a lot of things simultaneously. There was a customer backend, a travel agency backend, a unified frontend, the Fabric REST interface, and the Fabric network itself - eight-plus processes. Kubernetes gave me a way to run all of that in one KIND cluster (KIND is "Kubernetes IN Docker" - it runs a full Kubernetes cluster inside Docker containers on your laptop) without manually managing ports, docker networks, and process restarts.

So the choice wasn’t ideological. It was pragmatic: Kubernetes was what worked, and it handled the orchestration problem for free.

What the Kubernetes Deployment Actually Looks Like

Every component in this network is a Kubernetes resource. Let me show what that means concretely with the peer deployment.

Each peer has three Kubernetes resources: a Certificate (for TLS), a ConfigMap (for environment config), and a Deployment that runs the actual container. Here's the ConfigMap for org1-peer1, which shows how the peer is configured:

apiVersion: v1
kind: ConfigMap
metadata:
  name: org1-peer1-config
data:
  CORE_PEER_ID: org1-peer1.org1.example.com
  CORE_PEER_ADDRESS: org1-peer1:7051
  CORE_PEER_LOCALMSPID: Org1MSP
  CORE_PEER_MSPCONFIGPATH: /var/hyperledger/fabric/organizations/...
  CORE_PEER_GOSSIP_BOOTSTRAP: org1-peer2:7051
  CHAINCODE_AS_A_SERVICE_BUILDER_CONFIG: '{"peername":"org1peer1"}'
  CORE_LEDGER_STATE_STATEDATABASE: CouchDB
  CORE_LEDGER_STATE_COUCHDBCONFIG_COUCHDBADDRESS: localhost:5984

CORE_PEER_GOSSIP_BOOTSTRAP tells this peer to connect to org1-peer2 when it starts, for gossip - the protocol peers use to share ledger state with each other. CHAINCODE_AS_A_SERVICE_BUILDER_CONFIG tells the peer the name of this specific peer so the chaincode deployment knows which sidecar belongs to which peer. The CouchDB configuration is because the world state for this project is stored in CouchDB rather than the default LevelDB, which gives richer query capability.

The Deployment spec itself is interesting because it runs two containers in the same pod:

containers:
  - name: main
    image: ${FABRIC_PEER_IMAGE}
    ports:
      - containerPort: 7051 # gRPC for clients
      - containerPort: 7052 # gRPC for chaincode
      - containerPort: 9443 # operations/health
  - name: couchdb
    image: couchdb:${COUCHDB_VERSION}
    env:
      - name: COUCHDB_USER
        value: admin
      - name: COUCHDB_PASSWORD
        value: adminpw
    ports:
      - containerPort: 5984

The peer and its CouchDB instance are co-located in the same pod. CouchDB is accessed at localhost:5984 from inside the peer container - they share a network namespace since they're in the same pod. This is standard Kubernetes sidecar pattern.

The TLS certificate for the peer is handled by cert-manager, a Kubernetes add-on that automates certificate issuance. Each peer gets a certificate with multiple DNS names:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: org1-peer1-tls-cert
spec:
  dnsNames:
    - localhost
    - org1-peer1
    - org1-peer1.test-network.svc.cluster.local
    - org1-peer1.localho.st
    - org1-peer-gateway-svc
  secretName: org1-peer1-tls-cert
  issuerRef:
    name: org1-tls-cert-issuer

The certificate needs to cover all the names by which clients might reach this peer — inside the cluster, outside the cluster via ingress, and through the gateway service. TLS validation will reject a connection if the hostname the client is connecting to doesn’t match a name in the certificate. This is relevant because a single TLS handshake failure cascades into completely opaque errors that look like connection refused.

Screenshot: kubectl -n test-network get pods — showing the full list of running pods

Chaincode as a Service: Why the Chaincode Runs as Its Own Pod

In traditional Fabric deployments, the peer launches chaincode directly using Docker — when you install chaincode on a peer, the peer builds a Docker image and spins up a container. This is problematic in Kubernetes because it requires the peer container to have access to a Docker daemon (Docker-in-Docker), which is complex and generally frowned upon.

The solution is Chaincode as a Service (CCaaS). Instead of the peer spawning the chaincode, the chaincode runs as its own Kubernetes Deployment and exposes a gRPC server on port 9999. The peer connects to it at a known address. The chaincode is just another pod in the cluster.

The address is specified in a connection.json file that gets bundled into the chaincode package before deployment:

{
  "address": "{{.peername}}-ccaas-chaincode:9999",
  "dial_timeout": "10s",
  "tls_required": false
}

The {{.peername}} placeholder is substituted at packaging time - so org1peer1 becomes the address org1peer1-ccaas-chaincode:9999. The peer knows exactly which Kubernetes service to connect to.

The chaincode Kubernetes deployment is generated from a template, with the chaincode name, ID, and image substituted in by the deployment script:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: org1{{PEER_NAME}}-ccaas-{{CHAINCODE_NAME}}
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: main
          image: { { CHAINCODE_IMAGE } }
          env:
            - name: CHAINCODE_SERVER_ADDRESS
              value: 0.0.0.0:9999
            - name: CHAINCODE_ID
              value: { { CHAINCODE_ID } }
---
apiVersion: v1
kind: Service
metadata:
  name: org1{{PEER_NAME}}-ccaas-{{CHAINCODE_NAME}}
spec:
  ports:
    - name: chaincode
      port: 9999

One deployment and one service per peer, per org. With two peers per org and two orgs (org1 and org2), that’s four chaincode sidecar pods in total — org1peer1-ccaas-chaincode, org1peer2-ccaas-chaincode, org2peer1-ccaas-chaincode, org2peer2-ccaas-chaincode.

The CHAINCODE_ID is the critical environment variable. It's computed as sha256(chaincode.tgz) - the hash of the packaged chaincode archive. The peer and the chaincode container must agree on this value; if they don't match, the peer refuses to talk to the chaincode container.

Writing the Chaincode: What fabric-contract-api Is

The chaincode is written in TypeScript using a library called fabric-contract-api. Before getting into what this does, it helps to understand the problem it's solving.

Fabric chaincode communicates with the peer over gRPC — a low-level binary protocol. Without a framework, you’d be implementing the gRPC server yourself, handling message serialization, managing the chaincode lifecycle protocol, and making raw putState and getState calls. It's doable but tedious and error-prone.

fabric-contract-api wraps all of that. It lets you write a TypeScript class where each method is a smart contract function. You decorate the class and its properties with @Object, @Property, @Transaction, and @Info, and the framework handles the gRPC plumbing, serialization, and lifecycle.

Here is what the booking data model looks like with these decorators:

import { Object, Property } from "fabric-contract-api";

@Object()
export class Booking {
  @Property()
  public bookingID: string = "First Booking";

  @Property()
  public userID: string = "First User";

  @Property()
  public userHash: string = "";

  @Property()
  public isUserAnonymous: boolean = true;

  @Property()
  public agencyID: string = "";

  @Property()
  public travelID: number = 0;

  @Property()
  public seatNumbers: string = "1A,1B";

  @Property()
  public totalPrice: number = 100;

  @Property()
  public transactionID: string = "";

  @Property()
  public status: string = "Confirmed";

  @Property()
  public createdAt: string = "";

  @Property()
  public updatedAt: string = "";

  @Property()
  public cancelledAt: string = "";

  @Property()
  public refundAmount: number = 0;

  @Property()
  public penalty: number = 0;

  @Property()
  public availableSeats: number = 0;

  @Property()
  public hyperledgerTxId: string = "";
}

The @Object() decorator tells the framework this class represents a ledger asset. The @Property() decorators tell it which fields to include in serialization. When you call ctx.stub.putState(bookingID, Buffer.from(JSON.stringify(booking))), this object gets serialized to JSON and written to the world state under the bookingID key. Reading it back is just ctx.stub.getState(bookingID) and parsing the result.

The contract itself uses @Transaction() to mark functions that write to the ledger (submit transactions) and @Transaction(false) for read-only queries (evaluate transactions):

import { Context, Contract, Info, Transaction } from "fabric-contract-api";

@Info({
  title: "BookingContract",
  description: "Smart contract for recording travel bookings",
})
export class BookingContract extends Contract {
  @Transaction()
  public async RecordBooking(
    ctx: Context,
    bookingID: string,
    userID: string,
    isUserAnonymous: boolean,
    // ... other fields
  ): Promise<void> {
    const booking = new Booking();
    booking.bookingID = bookingID;
    // ... assign fields
    booking.hyperledgerTxId = ctx.stub.getTxID();
    await ctx.stub.putState(
      booking.bookingID,
      Buffer.from(JSON.stringify(booking)),
    );
  }

  @Transaction(false)
  public async ReadBooking(ctx: Context, bookingID: string): Promise<string> {
    const data = await ctx.stub.getState(bookingID);
    if (data.length === 0) {
      throw new Error(`Booking ${bookingID} does not exist`);
    }
    return data.toString();
  }

  @Transaction(false)
  public async BookingExists(
    ctx: Context,
    bookingID: string,
  ): Promise<boolean> {
    const data = await ctx.stub.getState(bookingID);
    return data.length > 0;
  }

  @Transaction(false)
  public async GetAllBookings(ctx: Context): Promise<string> {
    const iterator = await ctx.stub.getStateByRange("", "");
    const bookings = [];
    let result = await iterator.next();
    while (!result.done) {
      bookings.push(JSON.parse(result.value.value.toString()));
      result = await iterator.next();
    }
    await iterator.close();
    return JSON.stringify(bookings);
  }

  @Transaction()
  public async DeleteBooking(ctx: Context, bookingID: string): Promise<void> {
    const exists = await this.BookingExists(ctx, bookingID);
    if (!exists) {
      throw new Error(`The booking ${bookingID} does not exist`);
    }
    return ctx.stub.deleteState(bookingID);
  }
}

The ctx.stub.getTxID() call inside RecordBooking is important. When Fabric commits a transaction, it assigns a unique transaction ID to it. By capturing this inside the chaincode and storing it as hyperledgerTxId in the booking record, we can later look up exactly which block this booking is in. That's what the block height endpoint does.

GetAllBookings uses getStateByRange('', '') with empty strings for both bounds - that means "all keys." It returns a cursor-based iterator rather than loading everything at once, which matters if the ledger grows large.

The Privacy Design: Hashing User Identity

There’s a field in the Booking object called userHash, and a boolean called isUserAnonymous. This needs explaining.

The Fabric ledger in this architecture is not private — all peers in the network can read all bookings. Org1 (travel agencies) and Org2 (customers) share the same channel and therefore the same ledger. If a booking stored userName: "Yuvraj Raghuvanshi" and userEmail: "yuvraj@example.com" directly in the ledger entry, then every peer operator - including the travel agencies - could read that personal information.

Bitcoin has the same problem and solves it with a hash: your identity on the Bitcoin network is a hash of your public key, not your name. I used the same idea here. By default, only a hash of the user’s identifier is written to the ledger. The actual name and email stay in the customer backend’s database, which the travel agencies can’t access. If the user explicitly opts out of anonymity (isUserAnonymous: false), their internal application userID is written instead - still not their name or email, just an opaque identifier.

The personal information lives in the application layer. The ledger records that a booking was made. If you need to verify who made it, you go through the application, not the ledger directly.

The Chaincode Lifecycle: A Five-Step Process

This is one of the parts of Fabric that confuses people most. Deploying chaincode is not like deploying a Docker image. There is a formal five-step governance process:

1. Package : Bundle the chaincode source (or in CCaaS mode, the connection.json and metadata.json) into a .tgz archive. Compute its SHA256 hash - this becomes the CHAINCODE_ID.

2. Install : Copy the package to each peer that will run the chaincode. The peer stores it locally but doesn’t activate it yet.

3. Approve : Each organization’s admin issues a vote approving the chaincode definition: this name, this version, this sequence number. In a real multi-party network, each organization does this independently. The channel’s endorsement policy determines how many approvals are needed before the chaincode can be committed.

4. Commit : Once enough organizations have approved, one admin commits the chaincode definition to the channel. This is a channel-wide operation: after commit, all peers on the channel recognize the chaincode as active.

5. Launch (CCaaS only): In CCaaS mode, after the lifecycle is complete, the chaincode container must actually be running. The peer connects to it over gRPC at the address from connection.json.

Here is the approval step in the deployment script, showing the key arguments:

peer lifecycle chaincode approveformyorg \
  --channelID mychannel \
  --name chaincode \
  --version 1 \
  --package-id chaincode:${sha256_of_package} \
  --sequence ${next_seq_num} \
  --orderer org0-orderer1.localho.st:443 \
  --tls --cafile /path/to/org0-tls-ca.pem

The --sequence argument is where I ran into a concrete problem.

The Sequence Number Bug

When you first deploy chaincode, --sequence 1 is correct. The sequence number tracks how many times the chaincode definition has been updated on the channel. First deployment: 1. First update: 2. And so on.

The original chaincode.sh from fabric-samples had --sequence 1 hardcoded everywhere - in both the approveformyorg and commit commands. This works exactly once. The reset script tears down the KIND cluster and rebuilds everything from scratch, so each reset starts fresh - which means sequence 1 is always correct after a full reset.

But Fabric also supports incremental chaincode updates without tearing down the cluster. If you install a new version of chaincode on a running network, you need to increment the sequence. With the hardcoded 1, this would fail.

The fix was to query the current committed sequence and increment it:

function get_next_sequence() {
  local channel=$1
  local cc_name=$2

  export_peer_context org1 peer1

  current_seq=$(peer lifecycle chaincode querycommitted \
    -C $channel \
    -n $cc_name \
    --output json 2>/dev/null | jq -r '.sequence') || echo 0
  echo $((current_seq + 1))
}

If the chaincode hasn’t been committed yet, querycommitted returns nothing, we default to 0, and the next sequence is 1. If it's already been committed at sequence 1, the next sequence is 2. This makes incremental updates possible without touching the cluster.

Deploying to Both Orgs

The original chaincode.sh installed and approved chaincode only for org1. This is a problem.

The reason goes back to Fabric’s endorsement policy. When a client submits a transaction to the ledger, it doesn’t go directly to one peer. It goes to multiple peers for endorsement first. Each endorsing peer executes the chaincode, signs the result, and sends it back. The client collects enough endorsements to satisfy the policy, then sends the endorsed transaction to the orderers for ordering and commit.

If the endorsement policy requires signatures from both org1 and org2 — which is the correct setup for a two-party booking system — then org2 peers must also have the chaincode installed and approved. Otherwise, they can’t endorse transactions, and the policy is never satisfied.

I knew this from reading the architecture documentation before writing a line of deployment code. The fix was to loop over both orgs everywhere:

function install_chaincode() {
  local cc_package=$1

  for org in org1 org2; do
    install_chaincode_for ${org} peer1 ${cc_package}
    install_chaincode_for ${org} peer2 ${cc_package}
  done
}

function approve_chaincode() {
  local cc_name=$1
  local cc_id=$2
  local next_seq_num=$3
  for org in org1 org2; do
    export_peer_context ${org} peer1
    peer lifecycle chaincode approveformyorg \
      --channelID ${CHANNEL_NAME} \
      --name ${cc_name} \
      --version 1 \
      --package-id ${cc_id} \
      --sequence ${next_seq_num} \
      ...
  done
}

And for the CCaaS launches, org2 also needs its own chaincode sidecar pods. This required a separate org2-cc-template.yaml with the same structure as org1-cc-template.yaml but with org2 substituted throughout. Four CCaaS pods in total: org1peer1-ccaas-chaincode, org1peer2-ccaas-chaincode, org2peer1-ccaas-chaincode, org2peer2-ccaas-chaincode.

The REST Interface: Bridging the Fabric SDK and HTTP

Fabric doesn’t expose an HTTP API. The Fabric Node SDK communicates with peers over gRPC directly. To let the application backends interact with the blockchain over HTTP, there’s a separate service (the network REST interface) that wraps the SDK in an Express server.

I forked fabric-samples/asset-transfer-basic/rest-api-typescript, which had the right structure already. It manages two long-lived gRPC connections to the network (one authenticated as an org1 identity, one as an org2 identity) and keeps them open for the life of the server. Creating new connections per request is expensive and the wrong pattern with Fabric's SDK.

The key design decision in the original sample that I kept was the async job queue. Submitting a transaction to Fabric is not instant. The request goes to a peer for endorsement, then to the orderers for ordering, then to peers for validation and commit. This can take a few seconds. A naive synchronous REST endpoint would time out.

The solution is to return 202 Accepted immediately with a job ID, and queue the transaction for background processing. The caller polls /api/jobs/:jobId to find out when it's done:

POST /api/bookings
→ 202 Accepted, { jobId: "42" }

GET /api/jobs/42
→ { status: "complete", transactionId: "abc123..." }

The queue is implemented with BullMQ, which uses Redis as a backend. Each submitted transaction is a job in the queue. A worker process picks jobs off the queue, submits them to Fabric, and writes the result back to Redis. The job status endpoint reads from Redis.

The booking router handles authentication via API keys mapped to org identities. An API key for org1 tells the server to use the org1 connection profile and sign transactions with the org1 admin certificate. The key is passed as an X-Api-Key header. Both the customer backend and the travel agency backend have their own API keys.

// From auth.ts
const apiKeyOrgs: { [key: string]: string } = {
  [ORG1_APIKEY]: "Org1MSP",
  [ORG2_APIKEY]: "Org2MSP",
};

The endpoints the booking router exposes:

GET /api/bookings - Evaluate GetAllBookings chaincode function
GET /api/bookings/:bookingID - Evaluate ReadBooking
POST /api/bookings - Submit RecordBooking (queued)
DELETE /api/bookings/:bookingID - Submit DeleteBooking (queued)
GET /api/bookings/:hyperledgerTxId/blockheight - Query block position

Reading the Blockchain: Block Height and QSCC

The assignment required that bookings be verifiable on the blockchain. The simplest form of verification is proving not just that a booking exists in the world state, but that it was committed in a specific block that has subsequent blocks built on top of it.

Fabric has a system chaincode called QSCC — Query System Chaincode. It’s a built-in chaincode that runs on every peer and exposes ledger metadata. You can query it to find which block contains a given transaction, or to get the current height of the chain.

The block height endpoint works like this: take the hyperledgerTxId stored in the booking, call QSCC's GetBlockByTxID function on the peer, and decode the returned protobuf to find the block number. Then call GetChainInfo to find the current chain height. The difference tells you how many blocks have been added since this booking was committed.

bookingsRouter.get("/:hyperledgerTxId/blockheight", async (req, res) => {
  const contract = req.app.locals[mspId]?.qsccContract as Contract;
  const hyperledgerTxId = req.params.hyperledgerTxId;

  // Ask QSCC: which block contains this transaction?
  const blockBytes = await contract.evaluateTransaction(
    "GetBlockByTxID",
    "mychannel",
    hyperledgerTxId,
  );
  // QSCC returns raw protobuf bytes - decode with fabric-protos
  const block = common.Block.decode(blockBytes);
  const blockHeight = block.header.number.toString();
  // Get current chain height
  const chainInfo = common.BlockchainInfo.decode(
    await contract.evaluateTransaction("GetChainInfo", "mychannel"),
  );
  const currentHeight = chainInfo.height.toString();
  return res.status(OK).json({
    hyperledgerTxId,
    blockHeight, // Block where this booking was committed
    blockchainHeight: currentHeight, // Current chain height
  });
});

The protobuf decoding uses fabric-protos, a package that contains the compiled protobuf definitions for all Fabric message types. common.Block.decode(blockBytes) takes the raw bytes from QSCC and gives you a structured object with header.number as the block index.

Screenshot: Webapp (Fabric REST interface) showing hyperledgerTxId, blockHeight: 5, blockchainHeight: 7 — meaning 2 blocks have been added since this booking

The Reset Script and 30 Hours of Debugging

Every configuration change, every chaincode update, every time something was broken beyond quick repair — the reset script. It tears down everything and rebuilds:

./network down # Bring down peers, orderers, chaincode
./network unkind # Delete the KIND cluster
./network kind # Create a new KIND cluster
./network cluster init # Install cert-manager, nginx, set up namespaces
./network up # Launch CAs, peers, orderers
./network channel create # Create mychannel, join peers
./network chaincode deploy chaincode chaincode/ # Full chaincode lifecycle
./network rest-easy # Build and deploy the REST interface
kubectl -n test-network port-forward svc/fabric-rest-sample 3003:3000

From scratch, this takes about fifteen minutes. I ran it a lot.

yuvraj@Windows-11:~/mytravel/hyperledger$ ./reset
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Shutting down test network "test-network":
✅ - Stopping Fabric services ...
✅ - Scrubbing Fabric volumes ...
✅ - Deleting namespace "test-network" ...
🏁 - Fabric network is down.
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Deleting KIND cluster "kind":
✅ - Deleting KIND cluster kind ...
✅ - Deleting container registry "kind-registry" at localhost:5000 ...
🏁 - KIND Cluster is gone.
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Creating KIND cluster "kind":
✅ - Creating cluster "kind" ...
✅ - Launching container registry "kind-registry" at localhost:5000 ...
🏁 - KIND cluster is ready
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Initializing K8s cluster
✅ - Launching kind ingress controller ...
✅ - Launching cert-manager ...
✅ - Waiting for cert-manager ...
✅ - Waiting for ingress controller ...
🏁 - Cluster is ready
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Launching network "test-network":
✅ - Creating namespace "test-network" ...
✅ - Provisioning volume storage ...
✅ - Creating fabric config maps ...
✅ - Initializing TLS certificate Issuers ...
✅ - Launching Fabric CAs ...
✅ - Enrolling bootstrap ECert CA users ...
✅ - Creating local node MSP ...
✅ - Launching orderers ...
✅ - Launching peers ...
🏁 - Network is ready.
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Creating channel "mychannel":
✅ - Registering org Admin users ...
✅ - Enrolling org Admin users ...
✅ - Creating channel MSP ...
✅ - Creating channel genesis block ...
✅ - Joining orderers to channel mychannel ...
✅ - Joining org1 peers to channel mychannel ...
✅ - Joining org2 peers to channel mychannel ...
🏁 - Channel is ready.
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Deploying chaincode
✅ - Building chaincode image chaincode ...
✅ - Publishing chaincode image localhost:5000/chaincode ...
✅ - Packaging ccaas chaincode chaincode ...
✅ - Launching chaincode container "localhost:5000/chaincode" ...
✅ - Launching chaincode container "localhost:5000/chaincode" ...
✅ - Launching chaincode container "localhost:5000/chaincode" ...
✅ - Launching chaincode container "localhost:5000/chaincode" ...
✅ - Installing chaincode for org org1 peer peer1 ...
✅ - Installing chaincode for org org1 peer peer2 ...
✅ - Installing chaincode for org org2 peer peer1 ...
✅ - Installing chaincode for org org2 peer peer2 ...
✅ - Approving chaincode chaincode with ID chaincode:105d1916755525d103749c9d6245f1553cd7dc6b10be036d4cd574b050f99bf1 for org1 ...
✅ - Approving chaincode chaincode with ID chaincode:105d1916755525d103749c9d6245f1553cd7dc6b10be036d4cd574b050f99bf1 for org2 ...
✅ - Committing chaincode chaincode ...
🏁 - Chaincode is ready.
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
2026-05-01 14:59:49.508 UTC 0001 INFO [chaincodeCmd] chaincodeInvokeOrQuery -> Chaincode invoke successful. result: status:200 payload:"[]"
Fabric image versions: Peer (2.5.15), CA (1.5.19)
Fabric binary versions: Peer (2.5.15), CA (1.5.19)
Launching fabric-rest-sample application:
✅ - Constructing fabric-rest-sample connection profiles ...
✅ - Preparing the typescript REST interface ...
The fabric-rest-sample has started.
See https://github.com/hyperledger/fabric-samples/tree/main/asset-transfer-basic/rest-api-typescript for additional usage details.
To access the endpoint:
export SAMPLE_APIKEY=97834158-3224-4CE7-95F9-A148C886653E
curl -s --header "X-Api-Key: ${SAMPLE_APIKEY}" http://fabric-rest-sample.localho.st/api/assets
🏁 - Fabric REST sample is ready.
Forwarding from 127.0.0.1:3003 -> 3000
Forwarding from [::1]:3003 -> 3000

Somewhere in rest_sample.sh there is this function:

# This magical awk script led to 30 hours of debugging a "TLS handshake error"
# moral: do not edit / alter the number of '\' in the following transform:
function one_line_pem {
    echo "`awk 'NF {sub(/\\n/, ""); printf "%s\\\\\\n",$0;}' $1`"
}

This converts a multi-line PEM certificate file into a single-line string, which can be embedded in the JSON connection profile that the REST interface uses to connect to the peers. PEM files look like this:

-----BEGIN CERTIFICATE-----
MIICnTCCAkSgAwIBAgIUHqVnDpJd...
-----END CERTIFICATE-----

The JSON connection profile needs the certificate as a single string with literal \n characters instead of actual newlines. The awk script does that conversion. The number of backslashes in printf "%s\\\n" is not a mistake - it's exactly what's needed to survive multiple layers of shell interpretation (awk's string parsing, the outer shell's variable interpolation, and then the final JSON embedding).

I found it on Stack Overflow. The comment saying not to edit was already there. I ignored the comment. At some point while trying to understand what the function did, I adjusted the backslashes. The resulting connection profile looked syntactically fine (valid JSON, readable PEM string) but the embedded certificate was subtly malformed when parsed by the TLS library. The peers rejected connections with a generic TLS handshake error. Nothing in the error message pointed to the certificate content.

Thirty hours later I found the diff, restored the original function, and the network came back up immediately.

The lesson I took from this is specific: when you copy a piece of code that works and the original author has left a warning comment, take the comment more seriously than you take your own curiosity.

What It Feels Like to Develop with Fabric

Hyperledger Fabric is not built for rapid iteration. The formal chaincode lifecycle (package, install, approve, commit) exists for legitimate reasons in a real multi-organization network, where independent organizations need to independently audit and approve changes to shared business logic before those changes take effect. In that context, the process is the point.

In a student project with one developer and a fifteen-minute reset cycle, the friction is harder to appreciate. But some of the design choices still made genuine sense even at this scale.

The privacy model was the clearest one. Booking records on a distributed ledger are readable by every peer operator in the network. Storing userName: "Alice" and userEmail: "alice@example.com" directly in those records was obviously wrong. The user hash approach - borrow the idea from Bitcoin, keep personal data in the application layer, put only an opaque identifier on the chain - is the correct design regardless of whether you're building a student project or a production system.

The block height endpoint also felt worth building properly. Returning a booking record from the world state proves the booking exists now. Returning the block number and the current chain height proves when it was committed and that the chain has grown since then, making the record progressively harder to retroactively alter. That’s what blockchain verification actually means, and it’s different from just having a database record.

The rest of it (the endorsement policy, the CA infrastructure, the Raft consensus cluster) was mostly infrastructure I set up correctly and then tried not to touch. Which, given the awk script experience, is probably the right approach.

YuvrajRaghuvanshiS / mytravel

Single monorepo for the MyTravel project

MyTravel.com - Blockchain-Based Ticket Booking Platform

MyTravel.com is a comprehensive ticket booking system that leverages blockchain technology to ensure secure, transparent, and immutable transaction records. This platform integrates traditional web application architecture with Hyperledger Fabric blockchain infrastructure, providing a hybrid web2-web3 solution for customers and travel agencies. The system enables real-time booking management, dynamic pricing, and decentralized transaction verification while maintaining user privacy through anonymized blockchain interactions.

Project Architecture Overview

The platform follows a modular microservices architecture with four core components:

1. Customer Backend Service (Node.js/Express)

Handles customer-facing operations including:

User registration and JWT-based authentication
Travel listing discovery and filtering
Booking management with blockchain verification
Digital wallet operations
Profile management with anonymous mode support

2. Travel Agency Backend Service (Node.js/Express)

Manages agency-specific functionalities:

Agency registration and authentication
Travel route creation/updation
Seat inventory management
Booking reconciliation
Financial settlements

3. React Frontend Application

Provides unified user interface for:

Customer booking flow
Agency…

View on GitHub

This article is rewritten using AI chatbots.

I Told My Friend I’d Hack WhatsApp. Then I Actually Did It.

Yuvraj Raghuvanshi — Sun, 03 May 2026 12:25:15 +0000

A friend asked me if I could “hack” WhatsApp. I was bored, so I said yes.

I didn’t break any encryption. I didn’t exploit a server. What I found was something more interesting: a door that was left open inside Android itself, and an old version of WhatsApp that was still politely holding it open years after everyone else had moved on.

The result was a Python tool that extracts WhatsApp’s encryption key and message database from a phone (without root access) by exploiting the Android backup system. It picked up 540 stars on GitHub. It also got me a message from a developer trying to help people migrate from WhatsApp to Signal right before WhatsApp changed its terms of service to share data with Facebook. That part felt like it actually mattered.

This is how it works, from the ground up.

How Android Stores App Data

To understand the trick, you first need to understand how Android isolates apps from each other.

Every Android app runs in its own sandbox. This is not metaphorical — it’s enforced at the Linux kernel level. When an app is installed, Android creates a dedicated Linux user for it. WhatsApp might be u0_a123. Your banking app might be u0_a201. These are real Unix users with real UIDs. Because they're different users, they cannot read each other's files. The file permissions are enforced by the kernel, the same way they are on any Linux system.

The directory where an app’s private data lives is /data/data/. For WhatsApp, that's /data/data/com.whatsapp. Inside it, you'll find subdirectories that look like this:

/data/data/com.whatsapp/
    databases/
        msgstore.db ← your messages (plaintext SQLite)
        wa.db ← contacts
    files/
        key ← the encryption key
    shared_prefs/
    cache/
    lib/

/data itself is on a partition that is not accessible to normal users or normal processes. On a rooted phone, you can adb shell in and read it directly because root bypasses the permission system. On a non-rooted phone, you cannot. The files are there (they're on the storage) but the kernel will refuse every read attempt from any process that isn't WhatsApp itself.

This is the wall that everyone trying to extract WhatsApp data runs into.

Screenshot: adb shell showing “Permission denied” when trying to ls /data/data/com.whatsapp without root

The Key and the Database

Before going further, it’s worth explaining what these two files actually are and why you need both of them.

msgstore.db inside the sandbox at /data/data/com.whatsapp/databases/ is a plain, unencrypted SQLite database. WhatsApp works with it directly - reading and writing your messages in cleartext while the app is running. SQLite is a well-understood format; tools to read it are everywhere.

What you see if you browse your phone’s SD card storage is different: msgstore.db.crypt14, stored at /sdcard/WhatsApp/Databases/. This is the encrypted backup copy that WhatsApp writes to external storage so Google Drive can sync it. The number in the extension (crypt12, crypt14) indicates the encryption scheme version. This external copy is encrypted with AES-256 and is unreadable without the key.

The key file lives at /data/data/com.whatsapp/files/key. It's a binary file containing key material that WhatsApp generates based on private factors specific to your account and device. WhatsApp doesn't document the generation process publicly - we only know it exists because the file exists, and because without it, the crypt14 file is unreadable garbage.

Here’s what makes the key particularly valuable: it isn't rotated. Once you have it, you can use it to decrypt any crypt14 backup from that account (past ones, current ones, and future ones) until WhatsApp generates a new key. This is different from, say, TLS session keys that are discarded after use. The same key file that works today will work on a database backup taken six months ago, and on one taken six months from now.

The problem is getting the key. It lives inside the sandbox. Root gives you a sledgehammer to break through it. But there’s a quieter way.

What adb backup Is

Android Debug Bridge (adb) is a command-line tool included in the Android SDK. It lets a connected computer communicate with an Android device for development purposes: install APKs, run shell commands, read logs, transfer files. When you enable USB Debugging on your phone, you're enabling adb.

One of the things adb could do, from Android 4.0 onwards, was create full app backups:

adb backup -f myApp.ab -apk com.foobar.app

This would back up the app’s APK and its private data directory to a file on your computer. The backup system was designed for legitimate use: before switching phones, before a factory reset. It needed to read files from /data/data/ (which normally only the app itself can do) so it ran with elevated OS-level privileges to do so.

This is the door. The backup system could read app sandboxes. The question was whether it was allowed to for any given app.

AndroidManifest.xml and the allowBackup Flag

Every Android app ships with a file called AndroidManifest.xml. This is the app's declaration to the operating system - it lists the app's package name, the permissions it needs, the activities it contains, and dozens of other properties. Android reads this file when installing an app and uses it to configure how the app is treated by the system.

One of those properties is android:allowBackup. It controls whether adb backup is permitted to include this app's data in a backup. The default value, historically, was true - if you didn't specify it, backups were allowed.

When WhatsApp realised that this flag meant anyone with USB debugging access could extract their users’ messages with a single command, they set it to false in their manifest. After that point, running adb backup com.whatsapp would produce a backup file, but it would be empty. The backup system would see allowBackup="false" in the manifest and skip the data entirely.

Here is what that flag looks like inside a decompiled WhatsApp APK manifest:

<manifest xmlns:android="http://schemas.android.com/apk/res/android"
    package="com.whatsapp"
    android:versionCode="452658"
    android:versionName="2.21.1.1">

    <application
        android:label="@string/app_name"
        android:icon="@drawable/ic_launcher"
        android:allowBackup="false" ← current versions: explicitly denied
        ...>

Modern WhatsApp has allowBackup="false". But old versions of WhatsApp (specifically versions from before WhatsApp realised this was a problem) had allowBackup="true", or simply didn't specify it at all, defaulting to permitted.

The version used in this tool is v2.11.431 , from around 2015. At that version, the manifest permits backups. The Android backup system doesn't know it's running an old version. It reads the manifest, sees the flag, and opens the data directory.

The Trick: Uninstall (Keeping Data), Backup, Restore

The loophole requires three steps: uninstall current WhatsApp while keeping its data intact, install the old version that allows backups, run the backup, then reinstall current WhatsApp.

That first step (“uninstall while keeping data”) deserves explanation, because it’s not what most people think of when they hear “uninstall.”

Android normally does two things when you uninstall an app: it removes the APK, and it wipes /data/data/. Your messages, settings, everything - gone. This is the standard uninstall. But adb exposes a flag that separates these two operations:

adb shell pm uninstall -k com.whatsapp

The -k flag means "keep data." The APK is removed. The data directory at /data/data/com.whatsapp is left completely untouched - key, databases, shared preferences, all of it still there, owned by a UID that no longer has an app attached to it.

This is the correct way to do the downgrade. Android does not allow installing an older version of an app directly over a newer one — it checks version codes and refuses with INSTALL_FAILED_VERSION_DOWNGRADE. In-place downgrade is blocked. But if the app is already uninstalled (even with -k), there's nothing to compare version codes against. The legacy APK installs cleanly. Then when Legacy WhatsApp starts, it finds the existing data directory (the one that current WhatsApp left behind) and picks up from exactly where it was. The key file is there. The databases are there. Legacy WhatsApp doesn't care that the data was written by a newer version.

So the sequence is:

Current WhatsApp is installed on the device. Its data (key, msgstore.db) is in /data/data/com.whatsapp.
Uninstall current WhatsApp with adb shell pm uninstall -k com.whatsapp. The APK is gone; the data directory survives.
Install Legacy WhatsApp v2.11.431 via adb install. It finds the existing data directory and inherits it.
Run adb backup com.whatsapp. Legacy WhatsApp's manifest says allowBackup="true", so the backup system reads the full data directory and writes it to whatsapp.ab on the computer. This includes the key file and the plaintext msgstore.db.
Uninstall Legacy WhatsApp. Reinstall current WhatsApp. The data directory is still there. WhatsApp opens normally.

The user ends up with their current WhatsApp running unchanged, and the computer has a copy of the key and msgstore.db.

What Happens on the Phone Screen During Backup

When adb backup is triggered, Android shows a system prompt on the phone that the user must explicitly interact with. This is a security measure - silent background backups are not allowed.

Screenshot: the “Full backup” system dialog — the image uploaded above showing password field and “Back up my data” button

The dialog says: “A full backup of all data to a connected desktop computer has been requested. Do you want to allow this to happen? If you did not request the backup yourself, do not allow the operation to proceed.”

The user taps “Back up my data.” The backup begins.

There’s also a password field. If the user enters a password here, the backup archive is encrypted with AES before it’s written to disk. The script asks the user on the terminal whether they want to set a backup password. If they do, they need to type the same password into both places — the terminal (so the script knows it for decryption later) and the phone screen (so Android encrypts with it). If the passwords don’t match, the extracted backup will be unreadable.

This is also why one of the troubleshooting notes says: “If you have set a default backup password in your Android settings, then this MUST be the backup password that you PROVIDE when prompted.” Some users have a system-wide default backup password set in their Android settings that they’ve forgotten about. Android silently uses it to encrypt the backup. The script then can’t decrypt it because it received a different password.

The .ab File Format

adb backup produces a .ab file - Android Backup. This is not a zip or a standard archive format. Its structure is documented because Android is open source, so the format is known.

The file begins with a plaintext header:

ANDROID BACKUP
4
1
none

These four lines tell you: this is an Android backup file, format version 4, compressed (1 = yes), encryption algorithm (none, or AES-256 if a password was set). After the header, the rest of the file is a zlib-compressed tar archive. Inside the tar is the app’s data directory, with paths like:

apps/com.whatsapp/f/key
apps/com.whatsapp/db/msgstore.db
apps/com.whatsapp/db/wa.db

The f/ prefix corresponds to the files/ directory in the app's data, and db/ to databases/. The tar contains the full directory structure, just with these path prefixes.

To unpack this, the tool uses android-backup-extractor — a small open source Java utility that understands the .ab header, handles the optional AES decryption, decompresses the zlib payload, and extracts the tar. I did try to rewrite this part in Python. The unencrypted case is straightforward - strip the header, decompress with zlib, untar. But when a password is involved, the AES key derivation that Android uses involves specific parameters and byte-level handling that I couldn't get right. Rather than ship a broken decryptor, I kept the Java dependency. The android-backup-extractor handles it correctly.

java -jar abe.jar unpack whatsapp.ab whatsapp.tar [password]

After that, the tar is standard:

tar -xf whatsapp.tar

And the key file and databases are on disk, readable.

Installing the Legacy Version: INSTALL_FAILED_VERSION_DOWNGRADE

The first attempt most people make is to just install the legacy APK directly:

adb install LegacyWhatsApp.apk

Android refuses. It checks the version code of the APK being installed against the already-installed app. If the incoming version code is lower, it rejects with INSTALL_FAILED_VERSION_DOWNGRADE. Downgrade over an existing installation is blocked.

The correct approach is to uninstall current WhatsApp first (using the -k flag to preserve the data directory) and then install the legacy APK cleanly. With no existing installation to compare against, Android has no version code conflict to enforce.

The --allow-reboot flag handles a related but separate problem. On some devices, even after uninstalling with -k, the installation fails for other reasons. Rebooting before the install clears whatever state was causing the refusal. The exact mechanism is in the "if it works, don't touch it" category - the device is rebooted via adb reboot before the legacy APK is installed, and on devices where it was failing, it stops failing. The most likely explanation is that some manufacturers' Android builds check additional conditions at runtime that aren't re-evaluated immediately after boot.

There’s a related issue on MIUI (Xiaomi’s Android skin): adb install is blocked by a separate setting ("Install via USB") in Developer Options, distinct from USB Debugging. Without it, every install attempt fails with INSTALL_FAILED_USER_RESTRICTED, regardless of version codes.

Running Legacy WhatsApp Before the Backup

Issue #16 in the repository documents one of the more interesting device-specific behaviours encountered. On some devices, the backup would run without errors but produce a nearly empty archive — no key, no database.

The cause turned out to be that Legacy WhatsApp hadn’t been launched even once before the backup was taken. On those devices, Android’s backup system only includes app data that has been “activated” — meaning the app has run at least once since installation. Without that first launch, the data directory exists (carried over from the current WhatsApp installation), but the backup hook reports nothing to back up.

The fix was to launch Legacy WhatsApp briefly before triggering the backup. The script does this, with a prompt asking the user to open the app and let it sit for a few seconds before continuing. It’s the kind of fix that makes no sense until you see the behaviour it’s correcting.

Approximately 90% of the issues filed on this project follow the same pattern: something fails silently, the fix involves a step that has no obvious causal relationship with the problem, and once you add the step the issue disappears. Android’s backup system is not well-documented in its edge cases.

ADB Over TCP: Using the Tool Without a USB Cable — and Seeing the Screen

Normal adb communication happens over USB. The tool also supports ADB over TCP - connecting to a device over Wi-Fi instead.

python3 wa_kdbe.py --tcp-ip 192.168.43.130 --tcp-port 5555

Android has had ADB-over-TCP support built in since early versions. Once enabled (either through Developer Options on some devices, or by first connecting via USB and running adb tcpip 5555), the device listens for ADB connections on port 5555 over the local network.

The practical use cases: a broken USB port, a device on the other side of a room, or a device anywhere on the same network as the computer. The --tcp-ip flag accepts any IP address, including a phone connected via mobile hotspot. The tool works identically over TCP as over USB; the ADB protocol doesn't care about the transport layer.

When running in TCP mode, there’s an additional flag: --scrcpy. This uses scrcpy (an open source tool by Genymobile) to mirror the phone's screen to a window on the computer and allow touch input via mouse and keyboard. In USB mode this is less necessary since the device is physically in hand, but over TCP the phone might be in another room. With --scrcpy, the backup dialog that appears on the phone screen (the one asking the user to tap "Back up my data") can be interacted with directly from the computer. No need to walk over to the device.

python3 wa_kdbe.py --tcp-ip 192.168.43.130 --tcp-port 5555 --scrcpy

Both features were added because they could be added, and because removing the USB requirement while adding screen control made the tool work from genuinely anywhere on the same network.

The Signal Developer Who Reached Out

A few weeks after the project picked up traction, I received a message from a developer named Sam:

“I want to make it easy for people to switch from WhatsApp to Signal, which I think is important, especially given the upcoming changes to the TOS of WhatsApp [sharing all user data with Facebook from February 8]. I’ve forked the Signal Android App and added a WhatsApp import functionality, that migrates your existing WhatsApp threads from msgstore.db to Signal. However, the process of retrieving msgstore.db is way too complicated for many people and most people don’t want to root their phone.”

He had built a Signal fork that could import WhatsApp’s msgstore.db directly. The blocking problem was that getting msgstore.db off a non-rooted phone was too technically involved for most users. He was asking if the tool could be adapted to make the extraction seamless enough for non-technical users.

This is the part that reframes what the project actually was. What started as me being bored and proving something to a friend turned out to be infrastructure for something with a real privacy rationale: helping people leave a platform that had just announced it would share their data with one of the largest advertising companies in the world, and take their message history with them.

The msgstore.db that the tool extracts (once decrypted with the key) is a standard SQLite database. Every message is in there. You can read it with any SQLite browser, write queries against it, import it into other applications. The data is yours. It exists on your device. The only thing standing between you and it was the sandbox, the allowBackup flag, and the question of which version of WhatsApp was currently installed.

Why It No Longer Works Reliably

I have marked this tool as “NO LONGER MAINTAINED.” This needs context.

The allowBackup loophole in WhatsApp v2.11.431 hasn't been patched in the sense that the old APK was changed - the APK still has allowBackup="true". But Android itself has progressively made downgrade installations harder and backup extraction more restricted.

Newer Android versions (11, 12, 13+) have tightened the backup system significantly. Google restricted what adb backup could access, eventually deprecating the feature entirely in higher API levels. The backup dialog still appears, the process runs, but on many modern Android builds the data directories are excluded by the OS regardless of what the app manifest says. The flag that WhatsApp forgot to set stopped mattering because Android stopped honouring it.

The tool works on older Android versions and some devices where the restrictions haven’t fully landed. On others (most new devices running Android 12 or 13) it will produce an empty or near-empty backup. The backup system is still there; it just doesn’t open the data directories anymore.

I moved on when it became clear that the percentage of devices where it worked reliably was shrinking with every Android release, and that fixing it would require a different approach entirely — one that likely involves ADB shell commands and run-as, which has its own restrictions and device-specific behaviour.

What the Tool Actually Does, End to End

For completeness, here is the full sequence the script runs:

Connect to the device via USB or TCP. Verify adb can see it.
Back up the current WhatsApp APK from /data/app/com.whatsapp/ to /data/local/tmp/WhatsAppbackup.apk on the device. This is the user's currently installed version, saved so it can be reinstalled later.
Uninstall current WhatsApp with adb shell pm uninstall -k com.whatsapp. The -k flag keeps the data directory intact.
Install Legacy WhatsApp v2.11.431 via adb install. It finds the existing data directory and inherits it.
Prompt the user to open Legacy WhatsApp briefly on the phone, then return to the terminal.
Run adb backup com.whatsapp. The system dialog appears on the phone. The user taps "Back up my data."
Receive whatsapp.ab on the computer.
Unpack with android-backup-extractor : java -jar abe.jar unpack whatsapp.ab whatsapp.tar.
Extract the tar. Locate key and msgstore.db (and other databases).
Copy files to extracted//.
Optionally compress the folder as a password-protected 7z archive.
Uninstall Legacy WhatsApp. Reinstall the backed-up current WhatsApp APK.
Copy msgstore.db back to the phone's SD card for convenience.

The reinstallation in step 11 uses the APK that was saved in step 2 (the user’s exact current version) rather than downloading from the Play Store. This matters because the Play Store version might have updated since the backup was taken, and using an intermediate version could cause issues with the database format.

The whole process takes a few minutes. The window where the user’s WhatsApp is replaced by the legacy version is the longest part — the backup can take a while depending on database size. If something goes wrong during reinstallation and WhatsApp refuses to open, the recovery is a clean reinstall from the Play Store followed by restoring from the local or Google Drive backup that the README tells you to take before starting.

GIF: Screen recording of complete run of the tool

That warning at the top of the README (“Hope for the best, prepare for the worst”) is there for a reason.

Project URL: github.com/YuvrajRaghuvanshiS/WhatsApp-Key-Database-Extractor

This article is for educational purposes — for understanding how Android’s app sandbox and backup system work and how a single overlooked flag in a manifest file can undermine both. Use it on your own data.

This article is rewritten using AI chatbots.

May 03, 2026

The Website That Looked Like It Needed Selenium (But Didn’t)

Yuvraj Raghuvanshi — Thu, 30 Apr 2026 10:33:32 +0000

For my thesis I needed a large corpus of Hindi poetry. Hindwi is one of the better maintained Hindi literature archives on the internet. Thousands of poems, hundreds of poets, content spanning from the 8th century to contemporary writers. It had everything I needed.

I didn’t plan to spend much time on the scraper. Collect the data, move on.

That didn’t happen.

The Obvious Problem

Visit hindwi.org/poets and you’ll see a listing of poets. Scroll down and more appear. Visit an individual poet’s page and the same thing happens — poems load as you scroll. This is the pattern that makes every scraper writer reach for Selenium almost reflexively. The content isn’t in the initial HTML. JavaScript is loading it dynamically. You need a browser.

So I set up Selenium. Headless Chrome, scroll simulation, wait for elements to appear, extract content. It worked. It was also agonizingly slow.

The real problem wasn’t just speed — it was that Selenium is fundamentally impractical to parallelize. You can’t easily spin up ten browser instances and scrape ten poets simultaneously the way you can with threads making HTTP requests. Each browser instance carries its own rendering engine, memory space, and JavaScript runtime. The resource cost compounds quickly, and the coordination between instances is a nightmare. Even with aggressive parallelism, back-of-envelope math on 25,000+ poems made it clear this would take days, not hours.

There had to be a better way.

Ten Minutes in DevTools

Before writing any more Selenium code, I opened the browser DevTools Network tab and watched what actually happened when the page loaded more content.

This is always worth doing before committing to browser automation. Dynamic-looking behavior on the frontend is still, at the network level, just HTTP requests. The browser has to get the data from somewhere. The question is whether that somewhere is directly reachable.

On Hindwi, when you scroll to the bottom of the poets listing, the browser fires a request like this:

https://www.hindwi.org/PoetCollection?lang=2&pageNumber=2&Info=poet
&StartsWith=&keyword=&typeID=659186cb-44e7-4d94-8b1a-fc70f939a733
&TypeSlug=poets&contentFilter=&_=1777462454692

Plain GET request. No authentication tokens in the body, no encrypted signatures, no WebSocket handshake. Just query parameters. The _=1777462454692 at the end is a cache-busting timestamp the browser adds automatically - the server doesn't validate it, so scrapers can ignore it entirely.

The response that came back was raw HTML — not JSON, not XML. Just HTML cards containing poet names, dates, and profile links, ready to be injected into the DOM. So the website wasn’t serving a proper API, but it was serving something structured, paginated, and directly reachable over plain HTTP.

Screenshot: DevTools Network tab showing the /PoetCollection request and its HTML response body

The next question was: how does the browser know what URL to request for page 3, page 4, page 5? Where does that information come from?

The URL Was Sitting Right There

I looked at the page source. And there they were — all of them, already embedded in the initial HTML response:

<div class="contentLoadMore">
    <div class="contentLoadMorePaging" 
         data-url="/PoetCollection?lang=2&pageNumber=3&Info=poet
                  &StartsWith=&keyword=&typeID=659186cb-44e7-4d94-8b1a-fc70f939a733
                  &TypeSlug=poets&contentFilter=">
        <svg class="screenLoader" ...></svg>
    </div>
</div>

The site pre-embeds the URLs for every subsequent page inside data-url attributes on div.contentLoadMorePaging elements. The JavaScript reads these attributes and fires the requests when you scroll into view. But from a scraper's perspective, the URLs are already there in the first response - you don't need to scroll anything. You just parse them out and fetch them directly.

This was the moment Selenium became irrelevant.

What looked like dynamic JavaScript-driven content was really just a simple pattern: fetch the initial page, extract the hidden data-url values, make those HTTP requests directly. No browser. No scroll simulation. No waiting for DOM mutations.

Screenshot: page source with the contentLoadMorePaging div and data-url attribute visible

The Same Pattern, Everywhere

Once I knew what to look for, I checked the individual poet pages. Same pattern. A poet with more than 50 poems (Mona Gulati, for example) has this in her initial page response:

<div class="contentLoadMore">
    <div class="contentLoadMorePaging" 
         data-url="/PoetCollection?lang=2&pageNumber=2&info=ghazals
                  &SEO_Slug=kavita&Id=34074990-5be7-43e9-8a85-6aaa0be4833c
                  &Info=ghazal&StartsWith=a&typeID=659186cb-...
                  &contentType=kavita&sort=popularity-desc&filter=">
    </div>
    <div class="contentLoadMorePaging" 
         data-url="/PoetCollection?lang=2&pageNumber=3...">
    </div>
</div>

Both page 2 and page 3 are listed upfront in the first response. The site hands you the complete roadmap immediately. Fetch once, and you know exactly what to fetch next — no interaction, no scrolling, no waiting.

This held for dohas, quotes, and every other content type on the site. The contentLoadMorePaging pattern was consistent across all of Hindwi. Understanding it once meant the whole site was open.

Turning the Insight Into Code

The scraper that came out of this is conceptually simple. For the poet listing, hit the /PoetCollection endpoint and keep incrementing pageNumber until you get an empty response:

def _get_paginated_poet_cards(self, info, extra_params=None):
    page = 1
    while True:
        params = {"lang": 2, "pageNumber": page, "Info": info}
        if extra_params:
            params.update(extra_params)

        soup = get_soup(POETS_ENDPOINT, params=params)
        cards = soup.select("div.poetColumn")
        if not cards:
            break
        yield from cards
        page += 1

For poem lists, fetch the poet’s kavita page, parse whatever poems are already in the initial HTML, then extract and follow every data-url:

def _extract_poem_metadata(self, kavita_url):
    soup = get_soup(kavita_url)
    poems = self._parse_poem_list(soup)

    pagination_divs = soup.select("div.contentLoadMorePaging[data-url]")
    seen_urls = set()
    for div in pagination_divs:
        data_url = div.get("data-url")
        if not data_url or data_url in seen_urls:
            continue
        seen_urls.add(data_url)
        full_url = urljoin("https://www.hindwi.org", data_url)
        paginated_soup = get_soup(full_url)
        poems.extend(self._parse_poem_list(paginated_soup))
    return poems

No browser. No scroll events. Two BeautifulSoup calls per paginated poet.

One thing worth mentioning about _parse_poem_list: the initial page and the dynamically loaded fragment pages use different CSS classes for their poem cards. The initial listing uses div.rt_contentBodyListItems, while the paginated HTML fragments come back using div.contentListItems.nwPoetListBody. I caught this when certain poets were returning suspiciously fewer poems than their profile pages suggested - the paginated content was being silently skipped because the selector only matched the first class. A multi-selector handles both:

cards = soup.select(
    "div.rt_contentBodyListItems, div.contentListItems.nwPoetListBody"
)

This is exactly the kind of thing that produces wrong results silently. No error, no exception — just a poem count that’s quietly lower than it should be.

Screenshot: terminal output showing a poet being processed with their correct poem count

Extracting the Poems

Each poem lives on its own URL. The page serves the text in Devanagari and, for many poems, a Romanized transliteration toggled by a button. In the HTML, both versions are already present — just hidden or shown depending on which toggle is active:

# Devanagari
hindi_div = soup.find("div", {"class": "pMC", "data-roman": "off"})

# Romanized
roman_div = soup.find("div", id="HindwiRoman")
roman_pmc = roman_div.find("div", {"class": "pMC", "data-roman": "on"})

The text itself is structured as

tags containing tags per word or phrase. Joining the spans within each paragraph gives one line:

for p in hindi_div.find_all("p"):
    line = " ".join(span.get_text(strip=True) for span in p.find_all("span"))
    if line.strip():
        hindi_lines.append(line)

Both versions get saved as separate plain text files. Not every poem has a Romanized version, so the code returns None for the roman field when it doesn't exist rather than an empty list - preserving the distinction between "no Roman version" and "Roman version is blank."

Screenshot: a poem page on Hindwi showing the Devanagari text alongside the Roman toggle

Concurrency — The Real Payoff

With Selenium out of the picture, threading became trivial. The poem scraper processes all poets concurrently with a thread pool:

def scrape_poems(self, max_workers=10):
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = [executor.submit(self._process_poet, poet, i, total)
                   for i, poet in enumerate(self.poets)]
        for future in as_completed(futures):
            future.result()

Ten threads making lightweight HTTP requests is nothing. This is what was completely impractical with Selenium — ten browser instances would have needed a dedicated server to run without thrashing. Ten requests threads ran fine on a laptop, barely registering on the CPU.

Every request goes through a shared get_soup wrapper that enforces a 1-second politeness delay and retries with exponential backoff on failures. Errors at any level - a single poem, an entire poet - get logged and skipped rather than crashing the thread. The run completed cleanly over about two hours. A small number of URLs consistently returned server errors and landed in the log; everything else went through without issue.

The Result

Two hours. 25,000+ poems across hundreds of poets. Devanagari and Romanized versions where available. Structured metadata including titles, URLs, slugs, and categories per poem. Around 300MB of text in total.

The dependency list tells the whole story:

beautifulsoup4==4.13.4
requests==2.32.4

No Selenium, no browser drivers, no Playwright, no headless Chrome. Just HTTP requests and HTML parsing.

What I Took From This

The instinct to reach for Selenium when you see dynamic content is understandable — it’s the safe default that definitely works. But dynamic content loading just means the browser is making HTTP requests after the initial page load. Those requests go somewhere, return something, and in most cases can be replicated directly.

The contentLoadMorePaging pattern on Hindwi is a good illustration of how often websites like this are more accessible than they appear. The site wasn't hiding anything. It was handing out pagination URLs in plain HTML, sitting in data-url attributes, ready to be read. JavaScript happened to be the first thing reading them - until a scraper was.

Ten minutes in the Network tab before writing any scraping code is almost always worth it. In this case, it was the difference between days of Selenium pain and a two-hour requests script that finished before lunch.

This article is for educational purposes — all ethical considerations have been addressed, including measures such as rate limiting and conducting scraping during periods of low website traffic.

This article is rewritten using AI chatbots.

April 30, 2026

Training a Classifier on Huge Dataset When RAM Is Not Your Friend

Yuvraj Raghuvanshi — Mon, 13 Apr 2026 18:05:03 +0000

I didn’t set out to build a custom data loader. I set out to train a model on the Quick, Draw! dataset.

The data pipeline was supposed to be the boring part — the few lines you write before the interesting work starts. It ended up being most of the work, the source of the most frustrating bugs, and, in retrospect, the most interesting engineering decision of the whole project.

This is the story of why I ended up with a directory containing millions of individual .npy files, and why that turned out to be the right call.

What Quick, Draw! Actually Is

Quick, Draw! is a Google dataset of human drawings collected from a browser game where players had 20 seconds to draw a prompted word. It has 345 categories — cats, airplanes, zigzags, The Eiffel Tower — with up to 100,000 drawings per class. That’s about 50 million drawings in total.

What makes it interesting for ML, and annoying for data pipelines, is that each drawing has two representations:

Raster images — each drawing rendered as a 28×28 grayscale bitmap, stored as a flat array of 784 values. These come in .npy files where a single file for one class contains an array of shape (N, 784). For 100,000 samples, that's 100,000 rows of 784 floats per file.

Stroke sequences — the original drawing data: a sequence of (dx, dy, pen_state) triplets representing how the pen moved. These come in .npz files, split into train, val, and test keys. The stroke data varies in length per drawing - a simple zigzag might have 10 points, a detailed drawing of The Great Wall of China might have hundreds.

The model I wanted to build was multimodal: it would take both representations as input simultaneously, letting a CNN process the image and an LSTM process the stroke sequence, then merge their outputs for classification. Which meant the pipeline had to serve both modalities in sync, for every sample, across 345 classes.

Screenshot: a sample drawings from Quick, Draw! — both the raster image and stroke visualization side by side

The Naive Approach and Why It Dies

The obvious first attempt is the one-liner:

data = np.load("cat.npy") # shape: (~100000, 784)

That loads fine for one class. You run it for a few classes, you’re still fine. Then somewhere around class 20 or 30 your process gets killed by the OOM killer, or your Jupyter kernel crashes silently, or the remote server you’ve SSH’d into drops your connection and takes your training run with it.

With 345 classes at 30,000 samples each (my chosen limit) — we’re talking about loading roughly 10 million samples into RAM at startup. At around 11% of a 128GB server’s memory for 10,000 samples per class, the math on 30,000 samples gets uncomfortable fast. And that’s before you account for the stroke data.

The real problem isn’t just peak RAM usage. It’s that loading everything upfront means you can’t start training until loading finishes, the loaded arrays stay resident for the entire run, and any shuffle operation has to work over the full dataset in memory. All of this compounds.

There’s also a subtlety with the stroke files: they come pre-split into train/val/test partitions. If you want to do your own splits (which you do, so you can control the ratio and the random seed), you need to recombine them first and re-split yourself.

So before we get to the loader itself, there are three preprocessing steps to run.

Step 1: Downloading the Data

The download script fetches both file types from Google’s Cloud Storage. The listing endpoint returns XML, which the script parses to find the URLs for the classes you’ve defined in base_classes. Downloads run in parallel using a thread pool:

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    executor.map(lambda url: download_file(url, download_folder), file_urls)

Two separate calls — one for .npy raster files, one for .npz stroke files, filtered to the sketchrnn/ prefix:

download_quickdraw_files(..., file_type="npy")
download_quickdraw_files(..., file_type="npz", prefix_filter="sketchrnn/")

Files that already exist are skipped, which matters when you’re running this on a remote server where connections drop and you have to restart.

Screenshot: terminal output during download — the [DOWNLOAD] and [SKIP] lines showing parallel fetching

Step 2: Recombining the Stroke Splits

Each stroke .npz file has three keys: train, val, and test. Left as-is, you're working with a subset of the available data. The fix is to concatenate them:

combined = np.concatenate([data["train"], data["val"], data["test"]], axis=0)
np.savez_compressed(out_path, strokes=combined)

This runs in parallel across all classes using ProcessPoolExecutor. One thing worth noting: after combining, gc.collect() is called explicitly. This is a multiprocessing context and Python's garbage collector doesn't always release memory between processes the way you'd expect. Without this, a machine with moderate RAM will start sweating as dozens of processes hold combined arrays simultaneously.

Step 3: The Key Idea — One File Per Sample

This is the decision everything else depends on.

Instead of keeping each class as a single large .npy file, we explode every sample out into its own file:

dataset/processed/
  images/
    cat/
      000001.npy ← shape: (28, 28, 1)
      000002.npy
      ...
  strokes/
    cat/
      000001.npy ← shape: (130, 3)
      000002.npy
      ...

The conversion script loops over every class, loads the class-level arrays, preprocesses each sample, and saves them individually. The index is global across all classes — not per-class — which is what keeps image and stroke files aligned:

global_idx = 0
max_samples_per_class=100_000

for label_name, _ in LABEL_MAP.items():
    images = np.load(img_path, mmap_mode="r") # note: memory-mapped
    strokes = np.load(
                stroke_path, allow_pickle=True, encoding="latin1"
              )["strokes"]

    N = min(len(images), len(strokes), max_samples_per_class)

    for i in range(N):
        idx = global_idx + i
        np.save(
          f"images/{label_name}/{idx:06d}.npy", 
          preprocess_image(images[i])
        )
        np.save(
          f"strokes/{label_name}/{idx:06d}.npy", 
          preprocess_strokes(strokes[i])
        )

    global_idx += N

The image loading uses mmap_mode="r" - memory-mapped, so NumPy doesn't load the entire (100000, 784) array into RAM just to iterate over it row by row. The preprocessing happens at this stage, not at training time, so the generator later is just doing file reads.

This step takes a while to run. On the upside, it runs once.

Screenshot: the processed/ directory structure — showing the per-class subdirectories with numbered .npy files

What Preprocessing Actually Does

Images are straightforward. Reshape (784,) to (28, 28), divide by 255 to get [0, 1] floats, expand the channel dimension:

img = flat_img.reshape(28, 28).astype(np.float32) / 255.0
return np.expand_dims(img, axis=-1) # (28, 28, 1)

Strokes are more involved. The raw data uses relative coordinates — each (dx, dy) is an offset from the previous point, not an absolute position. This makes sense for how drawings are recorded but not for how a model should see them. The preprocessing converts to absolute, centers the drawing at the origin, then scales to a fixed [-100, 100] range:

# Relative -> absolute
strokes[:, 0] = np.cumsum(strokes[:, 0])
strokes[:, 1] = np.cumsum(strokes[:, 1])

# Center at origin
strokes[:, 0] -= strokes[:, 0].mean()
strokes[:, 1] -= strokes[:, 1].mean()

# Scale to [-100, 100]
max_coord = max(np.abs(strokes[:, 0]).max(), np.abs(strokes[:, 1]).max())
if max_coord > 0:
    strokes[:, 0] *= 100.0 / max_coord
    strokes[:, 1] *= 100.0 / max_coord

Stroke sequences are variable length. To get a fixed-size tensor for the LSTM, sequences are either truncated or zero-padded to 130 points. Why 130? Empirically, that covers the vast majority of drawings in the dataset without wasting too many zeros on the short ones.

The pen state column (the third feature) is left as-is — it’s already a binary indicator of whether the pen is lifted.

Screenshot: before/after visualization of a stroke — raw relative coordinates as a mess of lines, then the centered/normalized version looking like the actual drawing

The Loader

After preprocessing, the index step is fast. We walk the processed directory and collect all file paths:

for cls, label in LABEL_MAP.items():
    image_files = sorted(glob(f"{PROCESSED_DATA_DIR}/images/{cls}/*.npy"))
    stroke_files = sorted(glob(f"{PROCESSED_DATA_DIR}/strokes/{cls}/*.npy"))

N = min(len(image_files), len(stroke_files), SAMPLES_PER_CLASS)
    for i in range(N):
        images.append(image_files[i])
        strokes.append(stroke_files[i])
        labels.append(label)

At this point, images and strokes are just lists of strings. Nothing has been loaded into memory. The total dataset - 345 classes × 30,000 samples - indexes in a few seconds.

There’s also a threshold in the config: IN_MEMORY_THRESHOLD = 30_000. If SAMPLES_PER_CLASS is below that number, the loader will actually call np.load() during indexing and store the arrays directly. For quick experiments on a subset of data, this avoids the per-sample I/O overhead at training time. For large runs, it streams from disk instead.

USE_IN_MEMORY = USE_INDIVIDUAL and (SAMPLES_PER_CLASS <= IN_MEMORY_THRESHOLD)

Both paths feed into the same generator interface, which is a nice property — you can switch between them by changing one number.

The Generator and the tf.data Pipeline

The generator is a Python function that yields (image, stroke, one_hot_label) tuples:

def data_generator(images, strokes, labels):
    if USE_IN_MEMORY:
        for image, stroke, label in zip(images, strokes, labels):
            yield image, stroke, tf.one_hot(label, depth=NUM_CLASSES)
    elif USE_INDIVIDUAL:
        for img_path, str_path, label in zip(images, strokes, labels):
            yield (
                    np.load(img_path), 
                    np.load(str_path), 
                    tf.one_hot(label, depth=NUM_CLASSES)
                  )

This feeds into a tf.data.Dataset via from_generator, which requires explicit output signatures - TensorFlow needs to know shapes and dtypes upfront since it can't infer them from a Python generator:

output_signature = (
    tf.TensorSpec(shape=(28, 28, 1), dtype=tf.float32),
    tf.TensorSpec(shape=(130, 3), dtype=tf.float32),
    tf.TensorSpec(shape=(NUM_CLASSES,), dtype=tf.int32),
)

ds = tf.data.Dataset.from_generator(gen, output_signature=output_signature)

The full pipeline adds shuffling (shuffles a buffer of 10× the batch size rather than the entire dataset), repeating, batching at 512, and prefetching:

def build_dataset(images, strokes, labels, is_shuffle=False):
    ds = tf.data.Dataset.from_generator(gen, output_signature=output_signature)
    if is_shuffle:
        ds = ds.shuffle(BATCH_SIZE * 10)
    ds = ds.repeat()
    ds = ds.map(format_sample, num_parallel_calls=tf.data.AUTOTUNE)
    ds = ds.batch(BATCH_SIZE)
    return ds.prefetch(tf.data.AUTOTUNE)

The format_sample step reformats the yielded tuple into the dictionary format Keras expects for multi-input models:

def format_sample(img, stroke, label):
    return {"stroke_input": stroke, "image_input": img}, label

Shuffling indices, not files, is important here. The file layout on disk stays sequential — images for cat are in one directory, images for airplane in another. The shuffle happens in the data pipeline as it reads, which avoids random I/O seeks across the disk. Sequential reads are substantially faster than random ones, and the OS page cache will warm up the recently accessed files naturally.

Screenshot: htop showing RAM usage during training — relatively flat, not growing with training time

Splitting the Dataset

The split is index-based. We shuffle a global index array once with a fixed seed, then slice it:

indices = np.arange(total)
np.random.seed(42)
np.random.shuffle(indices)

train_end = int(0.8 * total)
val_end = train_end + int(0.1 * total)
train_idx = indices[:train_end]
val_idx = indices[train_end:val_end]
test_idx = indices[val_end:]

The 80/10/10 ratio applies across all classes since the indexing step already interleaved everything. There’s no risk of a class being entirely in the training set and absent from validation.

Validation and test datasets use .take() to consume a fixed number of batches - computed from the split sizes - since the generator repeats indefinitely:

val_ds = build_dataset(
            val_images, val_strokes, val_labels
          ).take(math.ceil(len(val_labels) / BATCH_SIZE))
test_ds = build_dataset(
            test_images, test_strokes, test_labels
          ).take(math.ceil(len(test_labels) / BATCH_SIZE))

What Went Wrong Along the Way

File count. The processed dataset ends up with roughly 345 × 30,000 × 2 = 20.7 million files. Some filesystems handle this poorly. If you're on a filesystem with inode limits or slow directory listing (common with some HPC storage systems), the sorted(glob(...)) calls at index time can take several minutes. Structured subdirectories (one per class) help, but it's still a lot of files.

Index alignment. The global index scheme — where file names reflect position across all classes, not within a class — exists entirely to prevent a specific bug. An earlier version used per-class indices, which caused a silent alignment failure: image cat/000001.npy and stroke cat/000001.npy were always aligned, but after shuffling, the code was pulling from globally-indexed lists and the class-local numbering didn't correspond. The {idx:06d} naming ensures that whatever index you retrieve from the lists, the image and stroke file names will match.

Training on a remote server with an unstable SSH connection. The training history in the notebook has a gap. BackupAndRestore meant the model weights survived; the history object didn't. TensorBoard logs were the fallback, and the actual metrics are there - the notebook's loss and accuracy plots just show what was available from the Python history object after reconnecting. If you're doing long training runs remotely, save the history separately and frequently, not just at the end.

Memory growth with TensorFlow’s GPU allocator. By default, TensorFlow pre-allocates the entire GPU memory. For a machine shared with other users, or one running other processes, this is a problem. The fix is:

for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

This makes TensorFlow allocate GPU memory incrementally as needed. It’s not set by default because it can slightly reduce performance in some scenarios, but for shared environments it’s basically always the right call.

What I’d Do Differently

The main thing I’d want to add is parallel file loading. Right now the generator is single-threaded — it loads one sample at a time, yields it, repeats. tf.data.AUTOTUNE on the prefetch helps by trying to keep the pipeline filled ahead of the model's consumption, but the actual I/O is sequential. Adding multiple generator workers (like PyTorch's num_workers) would reduce the time the GPU spends waiting for data.

LMDB would also be worth experimenting with. The advantage over millions of small files is that it’s a single file that supports fast key-value lookup, sequential reading, and doesn’t suffer from filesystem overhead per-entry. The disadvantage is that it complicates the setup and makes debugging harder. For this project the small-files approach was fast enough, but at larger scale it would start to matter.

A smarter caching strategy (keeping recently accessed samples in a bounded RAM buffer) would also help with the “warm up” problem. The first epoch is always slower than subsequent ones because the OS page cache starts cold. A pre-warmed in-memory buffer for the most frequently accessed samples would smooth that out.

The Part That Surprised Me

When I first sketched this out, my expectation was that disk-based loading would be noticeably slower than loading everything to RAM — enough to be a real bottleneck. It wasn’t, for a reason that only became clear after thinking about it: individual .npy file loads are fast. A (28, 28, 1) array at float32 is 3,136 bytes. A (130, 3) stroke array is 1,560 bytes. These are tiny files. The actual read time per sample is in the low microseconds, and the OS cache handles repeat accesses to recently-read files transparently.

What you trade away compared to pure in-memory loading is predictability. With everything in RAM, access time is constant. With disk loading, you’re occasionally hitting a file that isn’t cached, and that read takes longer. In practice, the prefetch buffer absorbs most of this variance. The GPU never actually sat idle waiting for data in my runs — the bottleneck was always computation, not I/O.

The other thing that surprised me was how much the single-file-per-class approach had been hiding. When everything for cat is one big (100000, 784) array, you have no choice but to load the whole thing before you can access any of it. That's a loading cost you pay every time. With individual files, you pay per sample - and you only pay for the samples you actually use.

The Notebook Setup (in case it’s useful)

One thing worth mentioning for anyone running this on a remote server: the port forwarding setup for Jupyter. If you’re SSH-ing into a machine and want to run notebooks rather than pulling .py files and running them in screen sessions, you forward the Jupyter port to localhost:

ssh -L 8888:localhost:8888 user@server_ip

# On the server:
jupyter notebook --no-browser --port=8888

If you’re going through two layers of SSH (e.g. a department gateway server that routes to a compute node), you just carry the forwarding through:

ssh -L 8888:localhost:8888 user@gateway
# On gateway:
ssh -L 8888:localhost:8888 user@compute_node

And for full control over Python and library versions, running the kernel inside a virtual environment is worth the setup time:

python3.12 -m venv .tens
source .tens/bin/activate
pip install jupyter ipykernel tensorflow numpy tqdm matplotlib
python -m ipykernel install --user --name=.tens

Then you can select .tens as the kernel in Jupyter and know exactly what Python version and library versions are running - which matters if you're planning to later quantize the model and deploy it somewhere like a Raspberry Pi, where the environment constraints are much stricter.

The pipeline ended up being more engineered than I originally wanted. But it runs, it doesn’t crash, and it’ll scale to more classes or more samples per class without changes. For a dataset this size on a memory-constrained machine, that’s the bar.

The code is all in the repository if you want to look at the actual implementation rather than the edited excerpts here. I’ll make this public once the paper is accepted.

This article is rewritten using AI chatbots.

April 14, 2026

Reverse Engineering SmartLock by Parivahan: What I Found Inside a Python Proctoring App

Yuvraj Raghuvanshi — Tue, 07 Apr 2026 12:14:33 +0000

I didn’t plan to reverse engineer a proctoring application. I just wanted to understand why a page kept refreshing in an infinite loop.

That one puzzling symptom ended up pulling me down a rabbit hole that took days to climb out of — involving PyInstaller internals, broken decompilers, browser automation quirks, and a race condition that convincingly pretended to be tamper detection. The journey was longer than I expected, and honestly more interesting. So I figured I might as well write it up.

The application in question is SmartLock by Parivahan , a proctoring system used for government driver’s learning license exams in India, and yes, I am getting a driving license at the age of 25. We all start somewhere.

It’s a Python desktop app that locks down your machine, watches your screen, monitors USB ports, controls your browser, and talks to a remote server — all at the same time. Understanding how it does all of this, and how the pieces fit together, is what this article is about.

Phase 1: Opening the Box

The first thing I did was look at the installation directory. This usually tells you a lot before you write a single line of code.

_internal/
browser/
config/
log/
Pictures/
Smartlock.exe

That _internal/ folder was the giveaway. It's a classic PyInstaller signature. The folder contains Python libraries and compiled bytecode - essentially, a self-contained Python runtime bundled into a single executable.

Screenshot: Installation directory structure showing _internal/ folder and Smartlock.exe

So the application was built in Python and packaged using PyInstaller. That meant extraction was possible using pyinstxtractor:

python pyinstxtractor.py Smartlock.exe

After extraction, the structure looked like this:

Smartlock.exe_extracted/
├── PYZ-00.pyz_extracted/
│ ├── asyncio/
│ ├── psutil/
│ ├── pydivert/
│ ├── selenium/
│ ├── websockets/
│ ├── win32com/
│ ├── yaml/
│ ├── controller.pyc
│ ├── registry_edit.pyc
│ └── ...
├── core.pyc
└── ...

Most of what you see here is noise — third-party libraries. The signal is in the handful of .pyc files: core.pyc, controller.pyc, registry_edit.pyc. These contain the actual application logic. Everything else is plumbing.

Phase 2: Decompilation and Why It’s Always Messier Than It Sounds

This is where things got annoying.

I tried the standard tools first:

uncompyle6 core.pyc
decompyle3 core.pyc

Both failed. Version mismatch — the bytecode was compiled with a Python version these tools didn’t fully support. I eventually got somewhere using pychaos, but I want to be honest about what “decompiled code” actually looks like in practice. It’s not clean. Comments are gone (they’re never stored in bytecode). Control flow gets reconstructed heuristically and is often wrong. You get artifacts like this:

while __CHAOS_PY_TEST_NOT_INIT_ERR__ :
    message = yield None

When the actual code probably looked something like:

async for message in websocket:
    msg = json.loads(message)

The decompiler is doing its best, but it’s guessing. Reverse engineering at this level is less about reading code and more about reconstructing intent from imperfect evidence. You develop a feel for what the code is trying to do, even when the syntax is broken.

The three key files broke down roughly like this:

core.pyc Main orchestrator - startup, thread management

controller.pyc Enforcement logic - monitoring, detection

registry_edit.pyc OS-level restrictions - registry modifications

Phase 3: Reconstructing the Architecture

Once I had a working (if imperfect) picture of the code, the overall architecture became clear:

Screenshot: detailed diagram showcasing the connections between SmartLock application, bundled Chrome application, and remote server

It’s a tightly coupled system. The desktop app and the browser aren’t independent — they’re in constant communication. And both of them are talking to the remote server. Remove any one of these connections and the whole thing breaks.

Phase 4: The Browser That Kept Redirecting

After reconstructing and running the application, I ran into something strange.

The exam webpage was redirecting continuously to a 403.jsp webpage and in a split second back to exam login webpage. Every few seconds — reload, reload, reload. My first instinct was that this was intentional tamper detection. After all, the whole point of a proctoring system is to detect when something isn’t right. Maybe it had detected something about my environment and was punishing me with an infinite loop.

That turned out to be wrong. But figuring out why it was wrong took a while.

The browser bundled with SmartLock isn’t a standard Chrome installation. It’s a portable Chromium build with a preconfigured user profile:

browser/
├── App/
│ └── Chrome-bin/
├── Data/
│ └── profile/
│ └── Default/

I inspected the stored cookies, session tokens, cached scripts, and extensions looking for some kind of tamper detection artifact. Nothing useful. The refreshing wasn’t coming from stored state.

Screenshot: browser/ directory structure

Phase 5: SmartSocket.js — Where It All Connected

The actual cause was in the exam webpage itself. Buried in the page source was this:

<script src="SmartLock/SmartSocket.js"></script>

This script establishes a WebSocket connection to the local application:

socket = new WebSocket('ws://localhost:8000/');

This is the critical link. The browser doesn’t just display the exam — it actively depends on the local application being alive and reachable. As soon as the connection is established, the browser authenticates:

reqOb.type = "Authentication";
reqOb.token = '1234';
reqOb.userid = appl_no;
socket.send(JSON.stringify(reqOb));

And if the connection fails — even momentarily — the page clears the session and reloads:

// On connection failure:
// → Clear session
// → Redirect or reload
// → UI resets to initial state

This is not tamper detection. It’s a strict runtime dependency. The browser requires the local WebSocket server to be up before it finishes loading. If it isn’t, you get an infinite reload loop.

Screenshot: SmartSocket.js connection code in browser devtools

Phase 6: The Race Condition

With that understanding, the root cause became obvious. The application starts the WebSocket server and the browser in parallel:

thread1 = start_websocket()
thread2 = launch_browser()

The problem with this is that “starting” the WebSocket server takes a moment. The browser, however, is fast — it loads the page, runs the script, and tries to connect to localhost:8000 before the server is actually ready. Connection fails. Page reloads. Tries again. Same thing. Infinite loop.

The sequence of events looked like this:

Browser loads page
 → SmartSocket.js executes immediately
 → Attempts WebSocket connection to localhost:8000
 → Server not ready yet
 → Connection refused
 → Page session cleared
 → Page reloads
 → Same thing happens again
 → ...forever

It perfectly mimicked tamper detection behavior, which is why I assumed that’s what it was. But it was just a timing issue.

The fix is simple: wait for the server to be ready before launching the browser.

def browser_start(self):
    try:
        import socket as sock
        import time

        for _ in range(50):
            try:
                s = sock.create_connection(("localhost", 8000), timeout=0.1)
                s.close()
                break
            except OSError:
                time.sleep(0.1)

        # Now launch browser

With this in place, the correct sequence is:

Start App
 → Start WebSocket server
 → Poll until server is accepting connections
 → Launch browser
 → WebSocket connects successfully
 → Authentication succeeds
 → Exam proceeds normally

Screenshot: Before — infinite reload loop vs successful startup

Phase 7: What the App Is Actually Doing Under the Hood

Once the startup problem was solved, I could look more carefully at all the enforcement mechanisms running in the background. There’s quite a lot.

OS-Level Lockdown

The registry editor modifies Windows to disable the usual escape routes:

DisableTaskMgr = 1
DisableLockWorkstation = 1
NoLogoff = 1

This disables Task Manager, the lock screen, and the ability to log off. The Ctrl+Alt+Del menu effectively becomes useless.

Process Monitoring

The monitoring engine maintains a list of software that shouldn’t be running during an exam. Screen recording and virtual camera tools are specifically targeted:

OBS Studio
ManyCam
XSplit

If any of these processes are detected, a violation is flagged.

USB Monitoring

The app takes a snapshot of connected USB devices at startup and watches for changes:

if current_usb != initial_usb:
    flag_violation()

Plugging in a USB drive during the exam is treated as a potential integrity violation.

Multi-Monitor and VM Detection

Multiple displays are blocked. The app also checks whether it’s running inside a virtual machine — which would make it easier to manipulate the environment without being detected.

Network Filtering via pydivert

This was the most interesting piece. The app uses pydivert - a Python wrapper around WinDivert - to implement packet-level network filtering. During an exam, only certain destinations are allowed. Everything else is dropped at the kernel level.

Phase 8: Two WebSockets, Not One

I initially assumed there was a single WebSocket connection: browser to local app. There are actually two:

Local WebSocket (localhost:8000) Browser ↔ Desktop App

Remote WebSocket Desktop App ↔ Remote Server

The local one handles session management for the browser. The remote one is for continuous telemetry — the app regularly sends status updates to the server:

{ "type": "USB", "status": true }
{ "type": "ProcessCheck", "detected": false }

The server isn’t passive. It’s continuously validating that the client is behaving correctly. If the telemetry stops or reports a violation, the server can terminate the session.

Configuration

All the connection details live in a YAML config file:

ExamIp: 164.100.69.5
ExamUrl: https://sarathi.parivahan.gov.in/sarathiservice/authenticationaction.do?authtype=Anugyna
SocketPort: 8000
SocketServerPort: 3000
SocketUrl: ws://sarathi.parivahan.gov.in
StatusApiUrl: https://sarathi.parivahan.gov.in/sarathiWS/rsServices/smartLockCheck/smartLockCheck
ViolationApiUrl: https://sarathicov.nic.in:8443/sarathiWS/rsServices/smartLockCheck/examViolation
primaryServerIPV4: 10.172.31.33
primaryServerIPV6: 2001:4408:7204:8:5d93:8239:8876:d238
secondaryServerIPV4: 10.172.31.30
secondaryServerIPV6: 2001:4408:7204:9::aac:2033

This means the behavior is somewhat server-controlled. The exam URL, the socket address, the session logic — it’s all configured externally, which makes the server the real authority over how the session runs.

Phase 9: The Firewall

One more thing worth mentioning: the app appears to interact with the Windows Firewall directly.

fw.pbox_fw_backup("pbox_bkp.wfw")

This exports the current firewall rules to a backup file before modifying them. The behavior is that the app replaces your firewall rules with its own restricted ruleset for the duration of the exam, then restores the backup afterward. It may also hash the backup to detect if someone has tampered with the rules mid-session.

Phase 10: Why You Can’t Just Rebuild It

One last I want to address directly: extracting the PyInstaller binary does not give you a working copy of the application. There’s a common misconception that extraction = reconstruction. It doesn’t.

The workflow for actually rebuilding would be:

Decompile .pyc files to .py
Manually correct the decompilation errors
Rebuild with PyInstaller

Steps 1 and 2 are where it falls apart in practice. The decompiled code has inaccuracies that aren’t always obvious. Some of the control flow is wrong in subtle ways that only become apparent at runtime. There are also timing dependencies baked into the threading model, and the server-side validation means you’d need a cooperating server to test anything properly.

The security here doesn’t come from any single mechanism being unbreakable. It comes from the combination:

The local app monitors the OS
The browser depends on the local app
The server monitors the local app
The network is filtered at the kernel level
The firewall is replaced during the session

Each layer on its own is probably defeatable. Together, they create a system where defeating one layer doesn’t help much because the others remain intact.

What I Took Away From This

A few things stuck with me after this investigation.

The infinite reload loop was genuinely convincing as tamper detection. I spent more time than I’d like to admit looking for a security mechanism that wasn’t there. The lesson is that emergent behavior from a race condition can look exactly like intentional defensive behavior. Don’t assume intent before you’ve traced the actual execution path.

The browser is doing real security work here, not just displaying a UI. SmartSocket.js is the link between the exam session and the local enforcement system. If that connection breaks, the exam can't proceed. That's a deliberate architectural choice, not an accident.

And PyInstaller extraction, while possible, is just the beginning. The hard part isn’t getting the bytecode out. It’s making sense of what the decompiler gives you and reconstructing what the developer originally meant to write.

What Could Come Next

If I were going to continue this investigation, the natural directions would be:

Mapping the full WebSocket protocol between browser and app, and between app and server
Tracing the telemetry payloads to understand exactly what data gets sent and when
Building a sequence diagram for the full session lifecycle, from startup through exam completion
Looking more carefully at the firewall manipulation and how (or whether) it detects tampering with the backed-up rules

None of that fits in one article. But the architecture is now clear enough that any of those threads could be pulled independently.

Screenshot: Final — application running normally with exam loaded, WebSocket connection established

This article is for educational purposes — understanding how production security systems are architected and why they’re difficult to tamper with. The focus throughout has been on the design and behavior of the system, not on defeating it.

This article is rewritten using AI chatbots.

April 07, 2026