Why transformed data still reveals where it came from
TL;DR: Your API responses contain fingerprints from your data sources. Your six decimal coordinates, ZIP+4 formats, and exact price values give away where you got your data from, even after you transform it. Solution: round your coordinates to 4 decimal places, standardize your address formats, and bin your exact numbers. And none of this hurts your product. It just stops you from bragging about your supply chain in each and every response.
I was reviewing my API responses one day and noticed that I was leaking the source of my data. Not through keys or logs, but through the six decimal coordinates, the ZIP+4 formatting, and the precise price values.
I had aggregated the data, transformed the data, scored the data, and productized the data. The problem is that the fingerprints of the source of the data had been left in the responses.
Here is how I discovered the issue and what I did about it.
Data Has Fingerprints
This is what caught my eye:
{
"latitude": 41.899223,
"longitude": -87.622225,
"address": "198 East Delaware Place, Chicago, IL 60611, USA",
"price_per_night": 129
}
Six decimal points on the coordinates which is a precision of about 10 centimeters. My API didn't need that, my users didn't need that. So why did it have this in?
Well, that's just the way it came. I was passing through the data without thinking about what it said.
In the world of plagiarism detection, we have something called "tells." These are the artifacts that give away the source. A student plagiarizes code, renames the variables, but has a particular comment or formatting. The content may vary, but the fingerprint remains the same.
My API had the same problem. The data may have been mine, but the fingerprints were not.
The Coordinate Problem
Different data providers store coordinates with different precision:
| Provider | Precision | Accuracy |
|---|---|---|
| High precision | 6 decimals | ~10cm |
| Standard | 5 decimals | ~1m |
| Rounded | 4 decimals | ~10m |
If your API provides 6 decimal coordinates, you are embedding someone else's fingerprint on your response. Your competitor can compare the values of your coordinates to the databases of the providers and can pinpoint your sources in a matter of minutes.
The solution is to round the coordinates to 4 decimal points. This is 10 meters. It is precise enough to place a hotel on a map but not precise enough to be traced.
def obfuscate_location(latitude, longitude, precision=4):
if latitude is None or longitude is None:
return None, None
return round(float(latitude), precision), round(float(longitude), precision)
Before: 41.899223, -87.622224
After: 41.8992, -87.6222
Still accurate. No longer a fingerprint.
The Address Format Problem
This one is subtle but significant. Different providers format addresses differently:
| Source | Format |
|---|---|
| Provider A | 198 East Delaware Place, Chicago, IL 60611, USA |
| Provider B | 198 E Delaware Pl, Chicago, 60611 |
| Provider C | 198 E. Delaware Pl., Chicago, IL 60611 |
If you pass through addresses unchanged, then the formatting itself becomes a fingerprint. "East" vs "E" vs "E." tells you exactly where you got the data.
The solution is to normalize the data into a canonical form.
# illustrative example — real world address normalization
# should rely on a proper parsing library or ruleset
import re
def normalize_address(address):
if not address:
return address
# standardize directionals
address = re.sub(r'\bEast\b', 'E', address, flags=re.IGNORECASE)
address = re.sub(r'\bWest\b', 'W', address, flags=re.IGNORECASE)
address = re.sub(r'\bNorth\b', 'N', address, flags=re.IGNORECASE)
address = re.sub(r'\bSouth\b', 'S', address, flags=re.IGNORECASE)
# remove periods from abbreviations
address = re.sub(r'\bE\.\s', 'E ', address)
address = re.sub(r'\bW\.\s', 'W ', address)
# standardize street types
address = re.sub(r'\bStreet\b', 'St', address, flags=re.IGNORECASE)
address = re.sub(r'\bAvenue\b', 'Ave', address, flags=re.IGNORECASE)
address = re.sub(r'\bPlace\b', 'Pl', address, flags=re.IGNORECASE)
address = re.sub(r'\bBoulevard\b', 'Blvd', address, flags=re.IGNORECASE)
# remove ZIP+4 extension
address = re.sub(r'(\d{5})-\d{4}', r'\1', address)
# remove country
address = re.sub(r',?\s*USA\s*$', '', address, flags=re.IGNORECASE)
address = re.sub(r',?\s*United States\s*$', '', address, flags=re.IGNORECASE)
return address.strip()
Output: 198 E Delaware Pl, Chicago, IL 60611
Now it could have come from anywhere. That is the point.
The Precision Problem
Exact numbers are fingerprints. If you're using a source where prices are rounded to the nearest dollar and you're returning that exact price in dollars, you're leaving a fingerprint. Same for review counts, distance, etc., any field that isn't generated by you.
The solution is to bucket everything.
def obfuscate_price_bucket(price):
if not price:
return None
price = float(price)
if price < 100:
bucket = (int(price) // BUCKET_SIZE_LOW) * BUCKET_SIZE_LOW
return f"${bucket}-{bucket + BUCKET_SIZE_LOW}"
elif price < 200:
bucket = (int(price) // BUCKET_SIZE_MID) * BUCKET_SIZE_MID
return f"${bucket}-{bucket + BUCKET_SIZE_MID}"
elif price < 500:
bucket = (int(price) // BUCKET_SIZE_HIGH) * BUCKET_SIZE_HIGH
return f"${bucket}-{bucket + BUCKET_SIZE_HIGH}"
else:
bucket = (int(price) // BUCKET_SIZE_PREMIUM) * BUCKET_SIZE_PREMIUM
return f"${bucket}-{bucket + BUCKET_SIZE_PREMIUM}"
Before: 129
After: $125-150
Your users get useful information. You do not expose exact values. Apply the same logic to review counts, distances, hotel counts. Anything that could be cross referenced against a known dataset.
Provenance leak is seldom due to one single field. It's usually a bunch of weak signals. Coordinate precision alone might not be enough. Coordinate precision, combined with address abbreviation style, price granularity, null behavior, and field ordering, however, tightens the problem space rapidly. Each of these fields contributes to a weak signal. Add many weak signals to get a strong signal. The defense has to be comprehensive for this reason. Normalization of the coordinates but leaving the addresses raw provides sufficient room for a competitor to get creative.
How I Think About It Now
Every field in an API response is either something you generated or something you inherited. The stuff you generated or transformed is yours. The stuff you inherited most often carries fingerprints from wherever it originated from.
The plagiarism detection parallel is exact. When I grade student submissions at Georgia Tech, I am not looking for identical code. I am looking for tells such as unusual variable names, specific comment styles, formatting quirks that match a known source. The student thinks they disguised the work, however the fingerprint says otherwise.
Your API is doing the same thing in reverse. You think you transformed the data. The six decimal coordinates, the ZIP+4 extension, and the exact dollar amounts say otherwise.
The fix is straightforward, you can normalize addresses, round coordinates, bucket prices. So that you reduce precision to what your users actually need and nothing more. None of this degrades the product, instead it just stops you from advertising your supply chain in every response.
The goal is not to destroy utility. It is to remove unnecessary precision that preserves supplier specific signatures without helping your users. Rounding coordinates to four decimals still places a hotel on a map. Bucketing prices still lets a traveler filter by budget. Normalizing addresses still gets someone to the front door. Good defense is selective degradation, not blind corruption. If the reduced precision would not change a single user decision, then the original precision was not serving your users. It was serving anyone trying to reverse-engineer your supply chain.
Before you send it out, go through the following checklist:
Lower the precision of coordinate values to appropriate levels for your product.
Reformat addresses into a format that makes sense for your product.
Group together highly specific numeric values.
Standardize the handling of null and default values for all fields.
Look for supplier specific weirdness that repeats across many fields.
Test whether your records can still be matched back to likely sources.
If your output still looks like the source, then the source is still in the output.
What Comes Next
Removing inbound fingerprints is a defensive measure that protects your sources, but it also has an offensive application.
With paying customers using unique API keys, you can reverse the approach by adding deterministic watermarks that trace each response to its recipient. If your data appears on a competitor's platform later, the watermark identifies the source of the leak.
I will cover the full watermarking implementation in the next post. I am also building a tool that automates fingerprint detection and obfuscation across API responses. More on that soon.
I'm Ioan Istrate, founder of Tripvento - a hotel ranking API that scores properties against 14 traveler personas using geospatial intelligence and semantic AI. Previously worked on ranking systems at U.S. News & World Report. If you want to talk about data provenance, supply chain obfuscation, or API fingerprinting*, let's connect on* LinkedIn.
This is part 7 of the Building Tripvento series. Part 1 covered deleting 55M rows with PostGIS. Part 2 covered the multi LLM self healing data pipeline. Part 3 covered the Django performance audit. Part 4 covered zero public ports and API security. Part 5 covered the pSEO content factory. Part 6 covered prompt injection, steganography tools, and the LLM honeypot.
Top comments (0)