Ioan G. Istrate

Posted on Apr 1 • Originally published at blog.tripvento.com

How I Fingerprint My Own API to Catch Scrapers

#django #security #api #python

TL;DR: Once you've stripped fingerprints from your data sources (Part 7), flip the script. Add your own watermarks so you can trace leaks back to specific customers. Coordinate jitter, price bucket skew, phantom records, and invisible text markers. All deterministic, all traceable, all invisible to users.

In Part 7, I discussed how to remove inbound fingerprints from your API responses. This includes things such as coordinates, addresses, pricing, etc.

This was defense.

This is offense.

Now that you have paying customers, each with a unique API key, you can add a watermark to each API response that will allow you to track who is using your information. Want to know who's selling your data on a competitor's site after six months? Well, you'll know.

These same techniques that catch plagiarizers, these same techniques that Google Maps uses to catch copycats, these same techniques that encyclopedias use to catch thieves.

Here's some ideas on how you can implement them, and one important notice about these techniques: they have to be non‑destructive, meaning watermarks must survive reasonable downstream transformations you expect customers to apply.

The Concept: Deterministic Watermarks

The key insight is that watermarks must be:

Invisible — They aren't visible to the user
Deterministic — Same input + same API key = same watermark
Unique — Different API keys should yield different watermarks
Verifiable — Prove that the leak came from one customer

If Customer A's data is showing up somewhere it shouldn't, you can hash their API key with the original values and verify the watermark.

Technique 1: Coordinate Jitter

This is the highest signal, lowest effort watermark. Add deterministic noise to coordinates based on the customer's API key.

import hashlib

def watermark_location(lat, lng, api_key):
    """
    Add deterministic jitter to coordinates.
    ~10–30m offset, unique per customer, invisible on maps.
    """
    seed = f"{api_key}:{lat}:{lng}".encode()
    h = hashlib.sha256(seed).digest()

    # Map bytes to a bounded jitter range
    lat_jitter = (int.from_bytes(h[:4], "big") % 600 - 300) / 1_000_000
    lng_jitter = (int.from_bytes(h[4:8], "big") % 600 - 300) / 1_000_000

    return lat + lat_jitter, lng + lng_jitter

Customer A sees: 41.8997, -87.6220

Customer B sees: 41.8999, -87.6222

They are both correct up to ~10 meters. They both work perfectly for mapping. They are different, however – and that difference is deterministic.

Verification

If you suspect a leak, take the coordinates from the leaked data and verify:

def verify_watermark(leaked_lat, leaked_lng, original_lat, original_lng, suspect_api_key):
    """Check if leaked coordinates match a specific customer's watermark."""
    expected_lat, expected_lng = watermark_location(original_lat, original_lng, suspect_api_key)

    # Allow small tolerance for floating point
    lat_match = abs(leaked_lat - expected_lat) < 0.00001
    lng_match = abs(leaked_lng - expected_lng) < 0.00001

    return lat_match and lng_match

If it matches, you've identified the source of the leak.

Technique 2: Price Bucket Skew

In Part 7, I covered how you can buck prices to remove fingerprints ($127 → $125-150). You can now flip this around and extend it by shifting bucket boundaries per customer.

def watermark_price_bucket(price, api_key):
    """
    Shift bucket boundaries slightly per customer.
    Same price, different bucket = traceable.
    """
    # Deterministic offset from API key (-2 to +2 dollars)
    offset = int(hashlib.sha256(api_key.encode()).hexdigest()[:4], 16) % 5 - 2
    adjusted_price = price + offset
    return obfuscate_price_bucket(adjusted_price)

Customer A: $123 → "$120-145"

Customer B: $123 → "$125-150"

Same hotel, same underlying price, different bucket. If someone's reselling your data, the bucket boundaries will match one of your customers.

Only apply bucket skew where prices are already presented as approximate ranges, not where customers expect cross account consistency.

Why This Works

The boundaries of price buckets seem completely arbitrary to end users. No one ever thinks, "You know what would make sense? If the bucket stopped at $125 instead of $120." However, when looking across thousands of records, the pattern becomes unmistakable. If a competitor's data lines up with the bucket boundaries of one of your customers, namely Customer B, then that's not a coincidence.

Technique 3: Phantom Records

Google Maps, for example, includes "trap streets" that exist only in the Google database. If another company's map also includes the same trap street, then they must be copying.

Encyclopedia Britannica used this strategy with fake entries called "Mountweazels." The name is derived from the fictional fountain designer, Lillian Virginia Mountweazel, who appeared in the 1975 New Columbia Encyclopedia.

The same strategy can be used with phantom records.

PHANTOM_HOTELS = {
    'chicago': {
        'id': 'phantom_chi_001',
        'name': 'The Lakefront Inn & Suites',
        'latitude': 41.8819,
        'longitude': -87.6278,
        'price': '$150-175',
        'rating': 4.5,
        'address': '1847 N Lake Shore Dr, Chicago, IL'
    },
    'new_york': {
        'id': 'phantom_nyc_001', 
        'name': 'Hudson River Boutique Hotel',
        'latitude': 40.7589,
        'longitude': -74.0012,
        'price': '$200-250',
        'rating': 4.3,
        'address': '847 W 42nd St, New York, NY'
    }
}

These hotels don't exist. They look real. They have real-sounding names like "The Lakefront Inn & Suites" instead of "Test Hotel 123." They have plausible coordinates, meaning a real place on a map where a hotel could exist. They have plausible pricing, meaning they charge what you'd expect in a neighborhood like that.

Making Phantoms Believable

The key is making phantom records indistinguishable from real data:

Realistic names — "The Lakefront Inn & Suites" not "Test Hotel 123"
Plausible coordinates — Real location where a hotel could exist
Consistent pricing — Matches the neighborhood's typical range
Complete data — All fields populated, no obvious gaps
Stable over time — Don't change phantoms frequently

The only thing that makes a phantom record detectable is that you know it's fake and no one else does.

Per-Customer Phantoms

For extra traceability, inject different phantom records for different customers:

def get_phantom_for_customer(city, api_key):
    """Return a customer-specific phantom hotel."""
    # Use API key to deterministically select which phantom variant
    variant = int(hashlib.sha256(api_key.encode()).hexdigest()[:2], 16) % 3
    return PHANTOM_VARIANTS[city][variant]

Now if a phantom appears in the wild, you know exactly which customer leaked it.

Technique 4: Invisible Text Markers

If your API returns text fields — descriptions, summaries, AI generated content — you can embed invisible markers using zero-width Unicode characters. This being said some platforms normalize or strip zero‑width characters; text watermarks should be treated as a high value signal, not guaranteed proof.

import hashlib

ZW0 = "\u200B"  # binary 0
ZW1 = "\u200C"  # binary 1

def watermark_text(text, api_key):
    """
    Embed an invisible, deterministic fingerprint into text.
    """
    digest = hashlib.sha256(api_key.encode()).hexdigest()
    fingerprint = int(digest[:4], 16)  # 16‑bit stable fingerprint

    bits = format(fingerprint, "016b")
    marker = "".join(ZW0 if b == "0" else ZW1 for b in bits)

    if ". " in text:
        return text.replace(". ", f". {marker}", 1)
    return text + marker

The text looks identical to humans:

"Located in downtown Chicago, this hotel offers stunning lake views. Guests enjoy the rooftop bar and fitness center."

But the binary representation contains your watermark:

"Located in downtown Chicago, this hotel offers stunning lake views.[invisible: 0100110101011010] Guests enjoy the rooftop bar and fitness center."

Detection

def extract_watermark(text):
    bits = []
    for ch in text:
        if ch == ZW0:
            bits.append("0")
        elif ch == ZW1:
            bits.append("1")
    if len(bits) >= 16:
        return int("".join(bits[:16]), 2)
    return None

def identify_source(text, api_keys):
    extracted = extract_watermark(text)
    if extracted is None:
        return None

    for key in api_keys:
        digest = hashlib.sha256(key.encode()).hexdigest()
        if int(digest[:4], 16) == extracted:
            return key
    return None

I built free tools to encode, decode, scan, and strip these invisible characters at tripvento.com/tools/zwsteg. There's also a homoglyph detector for catching Cyrillic lookalike characters. Both run client-side with nothing sent to any server.

Technique 5: Response Metadata

Sometimes the best security is letting people know you're watching.

def add_response_metadata(data, api_key, request_id):
    """Add tracking metadata to response."""
    return {
        "data": data,
        "meta": {
            "request_id": request_id,
            "key_fingerprint": hashlib.sha256(api_key.encode()).hexdigest()[:8],
            "generated_at": datetime.utcnow().isoformat() + "Z",
            "license": f"Data licensed to {get_customer_name(api_key)}. Redistribution prohibited."
        }
    }

It doesn't stop anything technically. A determined scraper will find a way to remove the metadata. But it does say: We are tracking this. We know who you are. We are paying attention.

It's the same reason why schools warn students that work will be scanned for plagiarism. The software is important. The warning is even more important. Most people won't steal if they think they'll be caught.

Implementation Strategy

When to Apply What

Technique	Demo/Public	Paid Customers
Coordinate jitter	❌ No	✅ Yes
Price bucket skew	❌ No	✅ Yes
Phantom records	❌ No	✅ Yes
Text watermarks	❌ No	✅ Yes
Response metadata	Optional	✅ Yes

Public/demo data doesn't need watermarks — there's no one to trace. Watermarking only makes sense when you have identifiable customers with unique API keys.

Integration Point

Add watermarking at the serializer level, after obfuscation but before response:

class HotelSerializer(serializers.ModelSerializer):
    location = serializers.SerializerMethodField()

    def get_location(self, obj):
        # Step 1: Obfuscate (strip source fingerprints)
        lat, lng = obfuscate_location(obj.latitude, obj.longitude)

        # Step 2: Watermark (add our fingerprints) - only for paid tiers
        api_key = self.context.get('api_key')
        tier = self.context.get('tier', 'demo')

        if tier != 'demo' and api_key:
            lat, lng = watermark_location(lat, lng, api_key)

        return {'latitude': lat, 'longitude': lng}

Logging for Verification

Keep a log of what you sent to whom:

def log_response(api_key, request_id, hotel_ids, timestamp):
    """Log response for future verification."""
    ResponseLog.objects.create(
        api_key_hash=hash_key(api_key),
        request_id=request_id,
        hotel_ids=hotel_ids,
        timestamp=timestamp,
        # Store original values for watermark verification
        original_coords=get_original_coords(hotel_ids)
    )

"""
Verification assumes you retain the canonical pre obfuscation coordinates that were used as the watermark input
"""

When investigating a suspected leak, you can reconstruct exactly what watermarks that customer should have received.

The Detection Workflow

When you suspect data theft:

Collect samples — Get coordinates, prices, text from the suspected copy
Identify candidates — Which customers had access to this data?
Verify watermarks — Run each customer's API key through verification
Check phantoms — Are any of your phantom records present?
Extract text markers — Scan for zero width character fingerprints
Document evidence — Screenshot everything, log the verification results

If multiple watermarking techniques point to the same customer, you have strong evidence.

Threat Model & Practical Limits

These watermarking techniques are designed to detect unauthorized reuse by lazy to moderately sophisticated actors — not a fully adversarial opponent with complete control over the data pipeline.

What This System Catches Well

Direct scraping and republishing
Naïve resale of API responses
Competitors ingesting data without normalization
Long-term aggregation and mirroring

What It Does Not Guarantee

Survival through aggressive data cleaning
Survival through manual rewriting
Attribution after intentional, expert-level laundering
Protection against customers who fully re derive facts independently

Watermarking is therefore evidence accumulating, not binary. A single signal may fail; multiple independent signals converging on the same customer rarely do.

This is why techniques are stacked:

Coordinates + prices + text + phantoms
Deterministic but heterogeneous
Robust across different transformation paths

The goal is not perfect prevention. The goal is credible, defensible attribution.

Legal, Ethical, and Product Constraints

Watermarking should never compromise user trust, factual correctness, or legal safety.

Required Guardrails

1. No User Facing Deception

Phantom records must never be:

Searchable by end users
Bookable or actionable
Indexed by public crawlers

They exist solely as internal honeypots.

2. No Material Misrepresentation

Apply price skewing only where prices are already approximate
Never alter fields that customers treat as exact or contractual

3. Attribution, Not Entrapment

Watermarks are for identifying misuse, not tricking users
Metadata warnings should be accurate and proportional

4. Jurisdiction Awareness

Laws governing data attribution, disclosure, and deceptive practices vary
Watermarking strategies should be reviewed alongside terms of service and local regulations

In short: Watermarking protects output without lying about reality. If a technique would confuse or mislead a good-faith customer, it should not be used.

The Takeaway

Defensive obfuscation protects your sources. Offensive watermarking protects your output.

Coordinate jitter — Invisible, deterministic, highest signal
Price bucket skew — Subtle, survives transformation
Phantom records — Honeypots that prove copying
Text watermarks — Invisible Unicode fingerprints
Response metadata — Overt deterrent

The same methods that catch plagiarizers will also catch those who misuse your API. You just have to think like both sides of the equation: the side trying to steal the information, and the side trying to catch them stealing it.

Years of catching students who thought they were so smart have shown me that it is not the smart ones who are the problem. It is the ones who are too lazy to think about it. Make it obvious that you are paying attention, and most problems will solve themselves.

DEV Community