How I Reverse-Lookup M-Pesa's Hashed Phone Numbers Using a 7.2 GB Binary File and One S3 Request

#safaricom #python #devops

At some point, Safaricom quietly changed the M-Pesa Daraja C2B callback payload. The MSISDN field — which used to contain a partially-masked phone number like 2547*****126 — started returning something like this:

fc418dcfe94c732a57f109e85e54d6c2e86f7c2afe6a2a5a6a5e5a9a9a9a9a9

A SHA256 hash. No announcement, no migration guide. Your integration just quietly broke.

This is how I solved it.

The problem

SHA256 is a one-way function. You cannot mathematically reverse it. The only way to recover the original phone number is to hash every possible candidate and check for a match.

For Kenyan Safaricom numbers the search space is 2547XXXXXXXX and 2541XXXXXXXX — exactly 200 million candidates. Hashing all of them on every callback request takes roughly 2 minutes. That's not a solution, that's a timeout.

Why the obvious approaches don't work

Approach	Lookup time	Problem
Brute-force per request	~2 min	Kills every callback before Safaricom's 5-second timeout
SQLite file	~5–50 ms	~15 GB with indexes, plus query overhead
PostgreSQL	~5 ms	Now you need a database server for a Lambda function
Rainbow tables	Saves some space	Complex, slower than direct lookup, not worth it for a closed 200M space

The insight that unlocked everything: the phone number space is closed and enumerable.

There are exactly 200 million valid numbers. That's small enough to pre-hash every single one, sort the results, and store them in a flat file. Every subsequent lookup becomes a binary search — ~28 comparisons, sub-millisecond, no server, no dependencies.

Building the database

Each record is 36 bytes: 32 bytes of SHA256 hash + 4 bytes of encoded phone index.

import hashlib
import struct

def hash_number(n: int) -> bytes:
    phone = f"2547{n:08d}" if n < 100_000_000 else f"2541{n - 100_000_000:08d}"
    return hashlib.sha256(phone.encode()).digest() + struct.pack(">I", n)

200 million of these, sorted by hash, packed into a flat binary file = 7.2 GB.

The build is parallelised across all CPU cores — each worker hashes and sorts a chunk independently, then the chunks are merged in a single streaming pass. On a modern laptop it takes 60–120 seconds and requires about 8 GB of free disk space. You only ever build it once.

python msisdn_lookup.py build
# Building hashes.bin...
# Workers: 12
# Done in 73s. Records: 200000000. File: 7.2 GB

Local lookup: binary search

With a sorted flat file, lookup is textbook binary search:

import mmap

def lookup(target_hash: bytes, f) -> str | None:
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    lo, hi = 0, RECORD_COUNT - 1
    while lo <= hi:
        mid = (lo + hi) // 2
        record = mm[mid * 36 : mid * 36 + 36]
        h, idx = record[:32], struct.unpack(">I", record[32:])[0]
        if h == target_hash:
            return index_to_phone(idx)
        elif h < target_hash:
            lo = mid + 1
        else:
            hi = mid - 1
    return None

~28 iterations. After the OS page cache warms up, repeated lookups are effectively instant.

The Lambda problem — and how uniform distribution solves it

Downloading 7.2 GB inside a Lambda function isn't viable. But you don't have to.

SHA256 output is uniformly distributed. That means if you sort 200 million SHA256 hashes, they're spread evenly across the full 32-byte hash space. The first 8 bytes of any target hash directly estimate its position in the sorted file:

def estimate_position(target_hash: bytes, record_count: int) -> int:
    prefix = struct.unpack(">Q", target_hash[:8])[0]
    max_prefix = (1 << 64) - 1
    return int(record_count * prefix / max_prefix)

This gives you a position estimate accurate to within a few thousand records. Fetch a 2 MB block centred on that position with a single S3 Range GET, binary search within the block — and you're done. One HTTP request, no cold-start overhead, no database.

import boto3

s3 = boto3.client("s3")

def lambda_lookup(target_hash: bytes) -> str | None:
    pos = estimate_position(target_hash, RECORD_COUNT)

    # Centre a 2 MB window on the estimated position
    start = max(0, (pos - 28_000) * 36)
    end = min(FILE_SIZE, start + 2 * 1024 * 1024)

    resp = s3.get_object(
        Bucket=LOOKUP_BUCKET,
        Key=LOOKUP_KEY,
        Range=f"bytes={start}-{end}"
    )
    block = resp["Body"].read()
    return binary_search_block(block, target_hash, start // 36)

The 2 MB window covers ~55,000 records. Given that the uniform distribution estimate is accurate to well within that range, statistically you will always find your hash in one request.

In practice: ~20 ms including S3 network latency. The Lambda never needs more than 256 MB of memory.

Putting it together in production

The full system is a serverless M-Pesa webhook receiver:

Safaricom callbacks
    → CloudFront (WAF IP allowlist)
    → API Gateway
    → Lambda (parse payload, S3 lookup, DynamoDB write)
    → DynamoDB Stream
    → Zoho Books sync + Google Sheets sync

Running costs at ~500 transactions/day: ~$0.05/month.

The MSISDN lookup adds roughly 20 ms to the confirm handler. The 5-second Safaricom timeout is never a concern.

Try it yourself

The lookup library is open source and has no external dependencies for core use — pure Python stdlib:

github.com/endafk/msisdn-lookup

git clone https://github.com/endafk/msisdn-lookup.git
cd msisdn-lookup
python msisdn_lookup.py build   # one-time, ~90s
python msisdn_lookup.py lookup fc418dcfe94c732a...
# → 254712345678

If you're building on the Safaricom Daraja API and hit the hashed MSISDN problem, this is probably the fastest solution that exists right now. And if you're not on M-Pesa — the technique applies anywhere you have a closed, enumerable pre-image space and need sub-millisecond lookups without a database.

Built and deployed in production. The approach came out of necessity — Safaricom's documentation on the hash change is, to put it generously, sparse.