DEV Community

Cover image for Your API Is Leaking Source Fingerprints. Here's How to Stop It.
Ioan G. Istrate
Ioan G. Istrate

Posted on • Originally published at blog.tripvento.com

Your API Is Leaking Source Fingerprints. Here's How to Stop It.

Why transformed data still reveals where it came from

TL;DR: Your API responses contain fingerprints from your data sources. Your six decimal coordinates, ZIP+4 formats, and exact price values give away where you got your data from, even after you transform it. Solution: round your coordinates to 4 decimal places, standardize your address formats, and bin your exact numbers. And none of this hurts your product. It just stops you from bragging about your supply chain in each and every response.

I was reviewing my API responses one day and noticed that I was leaking the source of my data. Not through keys or logs, but through the six decimal coordinates, the ZIP+4 formatting, and the precise price values.

I had aggregated the data, transformed the data, scored the data, and productized the data. The problem is that the fingerprints of the source of the data had been left in the responses.

Here is how I discovered the issue and what I did about it.

Data Has Fingerprints

This is what caught my eye:

{
  "latitude": 41.899223,
  "longitude": -87.622225,
  "address": "198 East Delaware Place, Chicago, IL 60611, USA",
  "price_per_night": 129
}
Enter fullscreen mode Exit fullscreen mode

Six decimal points on the coordinates which is a precision of about 10 centimeters. My API didn't need that, my users didn't need that. So why did it have this in?

Well, that's just the way it came. I was passing through the data without thinking about what it said.

In the world of plagiarism detection, we have something called "tells." These are the artifacts that give away the source. A student plagiarizes code, renames the variables, but has a particular comment or formatting. The content may vary, but the fingerprint remains the same.

My API had the same problem. The data may have been mine, but the fingerprints were not.

The Coordinate Problem

Different data providers store coordinates with different precision:

Provider Precision Accuracy
High precision 6 decimals ~10cm
Standard 5 decimals ~1m
Rounded 4 decimals ~10m

If your API provides 6 decimal coordinates, you are embedding someone else's fingerprint on your response. Your competitor can compare the values of your coordinates to the databases of the providers and can pinpoint your sources in a matter of minutes.

The solution is to round the coordinates to 4 decimal points. This is 10 meters. It is precise enough to place a hotel on a map but not precise enough to be traced.

def obfuscate_location(latitude, longitude, precision=4):
    if latitude is None or longitude is None:
        return None, None
    return round(float(latitude), precision), round(float(longitude), precision)
Enter fullscreen mode Exit fullscreen mode

Before: 41.899223, -87.622224

After: 41.8992, -87.6222

Still accurate. No longer a fingerprint.

The Address Format Problem

This one is subtle but significant. Different providers format addresses differently:

Source Format
Provider A 198 East Delaware Place, Chicago, IL 60611, USA
Provider B 198 E Delaware Pl, Chicago, 60611
Provider C 198 E. Delaware Pl., Chicago, IL 60611

If you pass through addresses unchanged, then the formatting itself becomes a fingerprint. "East" vs "E" vs "E." tells you exactly where you got the data.

The solution is to normalize the data into a canonical form.

# illustrative example — real world address normalization
# should rely on a proper parsing library or ruleset

import re

def normalize_address(address):
    if not address:
        return address

    # standardize directionals
    address = re.sub(r'\bEast\b', 'E', address, flags=re.IGNORECASE)
    address = re.sub(r'\bWest\b', 'W', address, flags=re.IGNORECASE)
    address = re.sub(r'\bNorth\b', 'N', address, flags=re.IGNORECASE)
    address = re.sub(r'\bSouth\b', 'S', address, flags=re.IGNORECASE)

    # remove periods from abbreviations
    address = re.sub(r'\bE\.\s', 'E ', address)
    address = re.sub(r'\bW\.\s', 'W ', address)

    # standardize street types
    address = re.sub(r'\bStreet\b', 'St', address, flags=re.IGNORECASE)
    address = re.sub(r'\bAvenue\b', 'Ave', address, flags=re.IGNORECASE)
    address = re.sub(r'\bPlace\b', 'Pl', address, flags=re.IGNORECASE)
    address = re.sub(r'\bBoulevard\b', 'Blvd', address, flags=re.IGNORECASE)

    # remove ZIP+4 extension
    address = re.sub(r'(\d{5})-\d{4}', r'\1', address)

    # remove country
    address = re.sub(r',?\s*USA\s*$', '', address, flags=re.IGNORECASE)
    address = re.sub(r',?\s*United States\s*$', '', address, flags=re.IGNORECASE)

    return address.strip()
Enter fullscreen mode Exit fullscreen mode

Output: 198 E Delaware Pl, Chicago, IL 60611

Now it could have come from anywhere. That is the point.

The Precision Problem

Exact numbers are fingerprints. If you're using a source where prices are rounded to the nearest dollar and you're returning that exact price in dollars, you're leaving a fingerprint. Same for review counts, distance, etc., any field that isn't generated by you.

The solution is to bucket everything.

def obfuscate_price_bucket(price):
    if not price:
        return None
    price = float(price)

    if price < 100:
        bucket = (int(price) // BUCKET_SIZE_LOW) * BUCKET_SIZE_LOW
        return f"${bucket}-{bucket + BUCKET_SIZE_LOW}"
    elif price < 200:
        bucket = (int(price) // BUCKET_SIZE_MID) * BUCKET_SIZE_MID
        return f"${bucket}-{bucket + BUCKET_SIZE_MID}"
    elif price < 500:
        bucket = (int(price) // BUCKET_SIZE_HIGH) * BUCKET_SIZE_HIGH
        return f"${bucket}-{bucket + BUCKET_SIZE_HIGH}"
    else:
        bucket = (int(price) // BUCKET_SIZE_PREMIUM) * BUCKET_SIZE_PREMIUM
        return f"${bucket}-{bucket + BUCKET_SIZE_PREMIUM}"
Enter fullscreen mode Exit fullscreen mode

Before: 129

After: $125-150

Your users get useful information. You do not expose exact values. Apply the same logic to review counts, distances, hotel counts. Anything that could be cross referenced against a known dataset.

Provenance leak is seldom due to one single field. It's usually a bunch of weak signals. Coordinate precision alone might not be enough. Coordinate precision, combined with address abbreviation style, price granularity, null behavior, and field ordering, however, tightens the problem space rapidly. Each of these fields contributes to a weak signal. Add many weak signals to get a strong signal. The defense has to be comprehensive for this reason. Normalization of the coordinates but leaving the addresses raw provides sufficient room for a competitor to get creative.

How I Think About It Now

Every field in an API response is either something you generated or something you inherited. The stuff you generated or transformed is yours. The stuff you inherited most often carries fingerprints from wherever it originated from.

The plagiarism detection parallel is exact. When I grade student submissions at Georgia Tech, I am not looking for identical code. I am looking for tells such as unusual variable names, specific comment styles, formatting quirks that match a known source. The student thinks they disguised the work, however the fingerprint says otherwise.

Your API is doing the same thing in reverse. You think you transformed the data. The six decimal coordinates, the ZIP+4 extension, and the exact dollar amounts say otherwise.

The fix is straightforward, you can normalize addresses, round coordinates, bucket prices. So that you reduce precision to what your users actually need and nothing more. None of this degrades the product, instead it just stops you from advertising your supply chain in every response.

The goal is not to destroy utility. It is to remove unnecessary precision that preserves supplier specific signatures without helping your users. Rounding coordinates to four decimals still places a hotel on a map. Bucketing prices still lets a traveler filter by budget. Normalizing addresses still gets someone to the front door. Good defense is selective degradation, not blind corruption. If the reduced precision would not change a single user decision, then the original precision was not serving your users. It was serving anyone trying to reverse-engineer your supply chain.

Before you send it out, go through the following checklist:

  • Lower the precision of coordinate values to appropriate levels for your product.

  • Reformat addresses into a format that makes sense for your product.

  • Group together highly specific numeric values.

  • Standardize the handling of null and default values for all fields.

  • Look for supplier specific weirdness that repeats across many fields.

  • Test whether your records can still be matched back to likely sources.

If your output still looks like the source, then the source is still in the output.

What Comes Next

Removing inbound fingerprints is a defensive measure that protects your sources, but it also has an offensive application.

With paying customers using unique API keys, you can reverse the approach by adding deterministic watermarks that trace each response to its recipient. If your data appears on a competitor's platform later, the watermark identifies the source of the leak.

I will cover the full watermarking implementation in the next post. I am also building a tool that automates fingerprint detection and obfuscation across API responses. More on that soon.


I'm Ioan Istrate, founder of Tripvento - a hotel ranking API that scores properties against 14 traveler personas using geospatial intelligence and semantic AI. Previously worked on ranking systems at U.S. News & World Report. If you want to talk about data provenance, supply chain obfuscation, or API fingerprinting*, let's connect on* LinkedIn.

This is part 7 of the Building Tripvento series. Part 1 covered deleting 55M rows with PostGIS. Part 2 covered the multi LLM self healing data pipeline. Part 3 covered the Django performance audit. Part 4 covered zero public ports and API security. Part 5 covered the pSEO content factory. Part 6 covered prompt injection, steganography tools, and the LLM honeypot.

Top comments (0)