DEV Community

James M
James M

Posted on

How I Reverse-Lookup M-Pesa's Hashed Phone Numbers Using a 7.2 GB Binary File and One S3 Request

At some point, Safaricom quietly changed the M-Pesa Daraja C2B callback payload. The MSISDN field — which used to contain a partially-masked phone number like 2547*****126 — started returning something like this:

fc418dcfe94c732a57f109e85e54d6c2e86f7c2afe6a2a5a6a5e5a9a9a9a9a9
Enter fullscreen mode Exit fullscreen mode

A SHA256 hash. No announcement, no migration guide. Your integration just quietly broke.

This is how I solved it.


The problem

SHA256 is a one-way function. You cannot mathematically reverse it. The only way to recover the original phone number is to hash every possible candidate and check for a match.

For Kenyan Safaricom numbers the search space is 2547XXXXXXXX and 2541XXXXXXXX — exactly 200 million candidates. Hashing all of them on every callback request takes roughly 2 minutes. That's not a solution, that's a timeout.


Why the obvious approaches don't work

Approach Lookup time Problem
Brute-force per request ~2 min Kills every callback before Safaricom's 5-second timeout
SQLite file ~5–50 ms ~15 GB with indexes, plus query overhead
PostgreSQL ~5 ms Now you need a database server for a Lambda function
Rainbow tables Saves some space Complex, slower than direct lookup, not worth it for a closed 200M space

The insight that unlocked everything: the phone number space is closed and enumerable.

There are exactly 200 million valid numbers. That's small enough to pre-hash every single one, sort the results, and store them in a flat file. Every subsequent lookup becomes a binary search — ~28 comparisons, sub-millisecond, no server, no dependencies.


Building the database

Each record is 36 bytes: 32 bytes of SHA256 hash + 4 bytes of encoded phone index.

import hashlib
import struct

def hash_number(n: int) -> bytes:
    phone = f"2547{n:08d}" if n < 100_000_000 else f"2541{n - 100_000_000:08d}"
    return hashlib.sha256(phone.encode()).digest() + struct.pack(">I", n)
Enter fullscreen mode Exit fullscreen mode

200 million of these, sorted by hash, packed into a flat binary file = 7.2 GB.

The build is parallelised across all CPU cores — each worker hashes and sorts a chunk independently, then the chunks are merged in a single streaming pass. On a modern laptop it takes 60–120 seconds and requires about 8 GB of free disk space. You only ever build it once.

python msisdn_lookup.py build
# Building hashes.bin...
# Workers: 12
# Done in 73s. Records: 200000000. File: 7.2 GB
Enter fullscreen mode Exit fullscreen mode

Local lookup: binary search

With a sorted flat file, lookup is textbook binary search:

import mmap

def lookup(target_hash: bytes, f) -> str | None:
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    lo, hi = 0, RECORD_COUNT - 1
    while lo <= hi:
        mid = (lo + hi) // 2
        record = mm[mid * 36 : mid * 36 + 36]
        h, idx = record[:32], struct.unpack(">I", record[32:])[0]
        if h == target_hash:
            return index_to_phone(idx)
        elif h < target_hash:
            lo = mid + 1
        else:
            hi = mid - 1
    return None
Enter fullscreen mode Exit fullscreen mode

~28 iterations. After the OS page cache warms up, repeated lookups are effectively instant.


The Lambda problem — and how uniform distribution solves it

Downloading 7.2 GB inside a Lambda function isn't viable. But you don't have to.

SHA256 output is uniformly distributed. That means if you sort 200 million SHA256 hashes, they're spread evenly across the full 32-byte hash space. The first 8 bytes of any target hash directly estimate its position in the sorted file:

def estimate_position(target_hash: bytes, record_count: int) -> int:
    prefix = struct.unpack(">Q", target_hash[:8])[0]
    max_prefix = (1 << 64) - 1
    return int(record_count * prefix / max_prefix)
Enter fullscreen mode Exit fullscreen mode

This gives you a position estimate accurate to within a few thousand records. Fetch a 2 MB block centred on that position with a single S3 Range GET, binary search within the block — and you're done. One HTTP request, no cold-start overhead, no database.

import boto3

s3 = boto3.client("s3")

def lambda_lookup(target_hash: bytes) -> str | None:
    pos = estimate_position(target_hash, RECORD_COUNT)

    # Centre a 2 MB window on the estimated position
    start = max(0, (pos - 28_000) * 36)
    end = min(FILE_SIZE, start + 2 * 1024 * 1024)

    resp = s3.get_object(
        Bucket=LOOKUP_BUCKET,
        Key=LOOKUP_KEY,
        Range=f"bytes={start}-{end}"
    )
    block = resp["Body"].read()
    return binary_search_block(block, target_hash, start // 36)
Enter fullscreen mode Exit fullscreen mode

The 2 MB window covers ~55,000 records. Given that the uniform distribution estimate is accurate to well within that range, statistically you will always find your hash in one request.

In practice: ~20 ms including S3 network latency. The Lambda never needs more than 256 MB of memory.


Putting it together in production

The full system is a serverless M-Pesa webhook receiver:

Safaricom callbacks
    → CloudFront (WAF IP allowlist)
    → API Gateway
    → Lambda (parse payload, S3 lookup, DynamoDB write)
    → DynamoDB Stream
    → Zoho Books sync + Google Sheets sync
Enter fullscreen mode Exit fullscreen mode

Running costs at ~500 transactions/day: ~$0.05/month.

The MSISDN lookup adds roughly 20 ms to the confirm handler. The 5-second Safaricom timeout is never a concern.


Try it yourself

The lookup library is open source and has no external dependencies for core use — pure Python stdlib:

github.com/endafk/msisdn-lookup

git clone https://github.com/endafk/msisdn-lookup.git
cd msisdn-lookup
python msisdn_lookup.py build   # one-time, ~90s
python msisdn_lookup.py lookup fc418dcfe94c732a...
# → 254712345678
Enter fullscreen mode Exit fullscreen mode

If you're building on the Safaricom Daraja API and hit the hashed MSISDN problem, this is probably the fastest solution that exists right now. And if you're not on M-Pesa — the technique applies anywhere you have a closed, enumerable pre-image space and need sub-millisecond lookups without a database.


Built and deployed in production. The approach came out of necessity — Safaricom's documentation on the hash change is, to put it generously, sparse.

Top comments (0)